Lecture Notes on Data Curation for Interactive Visualisation
Week 1: The relation between data acquisition, curation, processing, and visualisation
Lecture 1: Introdution
A recommented framework: Open Visualization Environment
A typical data processing pipeline: hypothesis, data acquisiiton, feature extration, classification/regression, presentation/visualisation.
- Data
- Data collection: source, update frequency, quantity, type of data
- Quality: data quality metrics, noise characterisation
- Storage: storage capacity, transfer latency, bandwidth, loading/saving
- Extraction
- Feature encoding/representation
- Normailisation
- Augmentation/imputation
- Model (not focused this course)
- Model Type: regression, classification, multi-label, multi-task, zero-shot/few-shot
- Training protocol
- Objective function
- Visualisation
- Feature decoding
- Presentation type
Lesson 2 Data quality and curation
1 Data life cycle
Hypothesis → Collection → Analysis → Storage → Dissemination (publication/visualization) → Archive or Destroy → Hypothesis
Data acquisition checklist
-
relevant data that could answer the hypothesis
-
a storage model based on the storage requirements and transfer “cost”
-
estimate cost (technological, effort and economic)
-
enough amount
-
generate synthetic data?
-
accuracy and veritication
-
manual annotation
-
the time preserved, update, versioning policy
Storage requirements
Storage format
- Files? Record? Unstructured vs structure dat
- Collect an early and extrapotale for the whole set
- types of access (privacy, license, autoritation, authentication)
Sampling frequency
- the frequency of sampling the data
- update frequency of the sourse data
Storage technology
- File repository (network, local file systems)
- SQL databases (i.e. Postgres, MySQL, etc), Column databases (i.e. influxDB), NoSQL databases (i.e. MongoDB), Search indexes (i.e. ElasticSearch)
2 Data collection and annotation
Annotation
Curation/Expert annotation
- Typically, the data quality is higher
- little trainning and explanation about the task
- expensive and requires more task
- very small pool of experts abailable to perform
Crowdsourcing
- a large pool of participants is available
- cheap to annotate large amount of data
- low quality
- requires a detailed trainning, more annotators