Introduction to Data Distribution🔗

Course Overview🔗

Data Distribution & Big Data Processing

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.

Objectives🔗

By the end of this module, participants will be able to:

Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
Implement the distribution of simple operations via the Map/Reduce principle in PySpark
Understand the principle of Kubernetes
Deploy a Big Data Processing Platform on the Cloud
Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask

Program🔗

Deployment & Intro to Orchestration (6h)🔗

Big Data & Distributed Computing (3h)🔗

Introduction to Big Data and its ecosystem (1h)
What is Big Data?
Legacy “Big Data” ecosystem
Big Data use cases
Big Data to Machine Learning
Big Data platforms, Hadoop & Beyond (2h)
Hadoop, HDFS and MapReduce,
Datalakes, Data Pipelines
From HPC to Big Data to Cloud and High Performance Data Analytics
BI vs Big Data
Hadoop legacy: Spark, Dask, Object Storage ...

Spark (3.5h)🔗

Kubernetes & Dask (3.5h)🔗

Containers Orchestration (1h)
Kubernetes & CaaS & PaaS (Databricks, Coiled)
Play with Kubernetes (if we have time)
Dask Presentation (1h)
Deploy a Data processing platform on the Cloud based on Kubernetes and Dask (1.5h)
Exercise: DaskHub or Dask Kubernetes or Pangeo

Evaluation (7h)🔗

Prerequisite: Pangeo platform deployed before
Clean big amounts of data using Dask in the cloud (3h)
Train machine learning models in parallel (hyper parameter search) (3h)
Notebook with cell codes to fill or answers to give

Evaluation introduction slides