Lecture Notes on Data Curation for Interactive Visualisation
Week 1: The relation between data acquisition, curation, processing, and visualisation
Lecture 1: Introdution
A recommented framework: Open Visualization Environment
A typical data processing pipeline: hypothesis, data acquisiiton, feature extration, classification/regression, presentation/visualisation.
- Data
- Data collection: source, update frequency, quantity, type of data
- Quality: data quality metrics, noise characterisation
- Storage: storage capacity, transfer latency, bandwidth, loading/saving
- Extraction
- Feature encoding/representation
- Normailisation
- Augmentation/imputation
- Model (not focused this course)
- Model Type: regression, classification, multi-label, multi-task, zero-shot/few-shot
- Training protocol
- Objective function
- Visualisation
- Feature decoding
- Presentation type
Lesson 2 Data quality and curation
1 Data life cycle
Hypothesis → Collection → Analysis → Storage → Dissemination (publication/visualization) → Archive or Destroy → Hypothesis
Data acquisition checklist
relevant data that could answer the hypothesis
a storage model based on the storage requirements and transfer “cost”
estimate cost (technological, effort and economic)
enough amount
generate synthetic data?
accuracy and veritication
manual annotation
the time preserved, update, versioning policy
Storage requirements
Storage format
- Files? Record? Unstructured vs structure dat
- Collect an early and extrapotale for the whole set
- types of access (privacy, license, autoritation, authentication)
Sampling frequency
- the frequency of sampling the data
- update frequency of the sourse data
Storage technology
- File repository (network, local file systems)
- SQL databases (i.e. Postgres, MySQL, etc), Column databases (i.e. influxDB), NoSQL databases (i.e. MongoDB), Search indexes (i.e. ElasticSearch)
2 Data collection and annotation
Curation/Expert annotation
- Typically, the data quality is higher
- little trainning and explanation about the task
- expensive and requires more task
- very small pool of experts abailable to perform
- a large pool of participants is available
- cheap to annotate large amount of data
- low quality
- requires a detailed trainning, more annotators