Lecture Notes on Data Curation for Interactive Visualisation

2022-09-14
2分钟阅读时长

Week 1: The relation between data acquisition, curation, processing, and visualisation

Lecture 1: Introdution

A recommented framework: Open Visualization Environment

A typical data processing pipeline: hypothesis, data acquisiiton, feature extration, classification/regression, presentation/visualisation.

  • Data
    • Data collection: source, update frequency, quantity, type of data
    • Quality: data quality metrics, noise characterisation
    • Storage: storage capacity, transfer latency, bandwidth, loading/saving
  • Extraction
    • Feature encoding/representation
    • Normailisation
    • Augmentation/imputation
  • Model (not focused this course)
    • Model Type: regression, classification, multi-label, multi-task, zero-shot/few-shot
    • Training protocol
    • Objective function
  • Visualisation
    • Feature decoding
    • Presentation type

Lesson 2 Data quality and curation

1 Data life cycle

Hypothesis → Collection → Analysis → Storage → Dissemination (publication/visualization) → Archive or Destroy → Hypothesis

Data acquisition checklist
  • relevant data that could answer the hypothesis

  • a storage model based on the storage requirements and transfer “cost”

  • estimate cost (technological, effort and economic)

  • enough amount

  • generate synthetic data?

  • accuracy and veritication

  • manual annotation

  • the time preserved, update, versioning policy

Storage requirements
Storage format
  • Files? Record? Unstructured vs structure dat
  • Collect an early and extrapotale for the whole set
  • types of access (privacy, license, autoritation, authentication)
Sampling frequency
  • the frequency of sampling the data
  • update frequency of the sourse data
Storage technology
  • File repository (network, local file systems)
  • SQL databases (i.e. Postgres, MySQL, etc), Column databases (i.e. influxDB), NoSQL databases (i.e. MongoDB), Search indexes (i.e. ElasticSearch)

2 Data collection and annotation

Annotation
Curation/Expert annotation
  • Typically, the data quality is higher
  • little trainning and explanation about the task
  • expensive and requires more task
  • very small pool of experts abailable to perform
Crowdsourcing
  • a large pool of participants is available
  • cheap to annotate large amount of data
  • low quality
  • requires a detailed trainning, more annotators

3 Data Quality metrics