Skip to content

Introduction to Data DistributionπŸ”—

Course Introduction

Course OverviewπŸ”—

  • Data Distribution & Big Data Processing

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.

ObjectivesπŸ”—

By the end of this module, participants will be able to:

  • Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
  • Implement the distribution of simple operations via the Map/Reduce principle in PySpark
  • Understand the principle of Kubernetes
  • Deploy a Big Data Processing Platform on the Cloud
  • Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask

ProgramπŸ”—

Deployment & Intro to Orchestration (6h)πŸ”—

Big Data & Distributed Computing (3h)πŸ”—

Spark (3.5h)πŸ”—

Kubernetes & Dask (3.5h)πŸ”—

Evaluation (7h)πŸ”—

  • Prerequisite: Pangeo platform deployed before
  • Clean big amounts of data using Dask in the cloud (3h)
  • Train machine learning models in parallel (hyper parameter search) (3h)
  • Notebook with cell codes to fill or answers to give

Evaluation introduction slides