Introduction to Data Distributionπ
Course Overviewπ
- Data Distribution & Big Data Processing
Harnessing the complexity of large amounts of data is a challenge in itself.
But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.
Objectivesπ
By the end of this module, participants will be able to:
- Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
- Implement the distribution of simple operations via the Map/Reduce principle in PySpark
- Understand the principle of Kubernetes
- Deploy a Big Data Processing Platform on the Cloud
- Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask
Programπ
Deployment & Intro to Orchestration (6h)π
Big Data & Distributed Computing (3h)π
- Introduction to Big Data and its ecosystem (1h)
- What is Big Data?
- Legacy βBig Dataβ ecosystem
- Big Data use cases
- Big Data to Machine Learning
- Big Data platforms, Hadoop & Beyond (2h)
- Hadoop, HDFS and MapReduce,
- Datalakes, Data Pipelines
- From HPC to Big Data to Cloud and High Performance Data Analytics
- BI vs Big Data
- Hadoop legacy: Spark, Dask, Object Storage ...
Spark (3.5h)π
Kubernetes & Dask (3.5h)π
- Containers Orchestration (1h)
- Kubernetes & CaaS & PaaS (Databricks, Coiled)
- Play with Kubernetes (if we have time)
- Dask Presentation (1h)
- Deploy a Data processing platform on the Cloud based on Kubernetes and Dask (1.5h)
- Exercise: DaskHub or Dask Kubernetes or Pangeo
Evaluation (7h)π
- Prerequisite: Pangeo platform deployed before
- Clean big amounts of data using Dask in the cloud (3h)
- Train machine learning models in parallel (hyper parameter search) (3h)
- Notebook with cell codes to fill or answers to give