Spark and HADOOP

Invalid data source. Please correct the following errors:
  • The specified Sheet Name (Workshops (2)) was not found. Please try again with the correct one from the following list:\nWindows Server 2019 Microsoft Azure 2 React Microsoft 365 Projektų valdymas Asmeninis efektyvumas Scrum ITIL4F ITIL4P ITIL4L

Course overview:

The Workshop will cover basic concepts of Hadoop and mostly in The Cloudera stack, like  using HBase & Impala to query data, using Spark to stream data, afterwards we will launch a Cloudera quickstart, using datasets of top-rated movies in the workshops, getting the data analyzed and queried with Hadoop, explaining & demonstrating  Map Reduce Concepts, RDD Partition on Spark.


The main Goal is to really Understand what big data is , how to ingest data , main concepts for Hadoop Data warehouse , and utilize & stream Spark with Big Data.

Target audience

Entry Level in Big Data, DBA’s , BI Engineers, familiarity in Open Source Systems.

Technical requirements

  • Installations:
    • Docker Installed on Linux : sudo apt-get install
    • Download the Cloudera QuickStart Image : docker pull cloudera/quickstart:latest
    • Start the Cloudera stack Container:

    docker run –hostname=quickstart.cloudera –privileged=true -t -i -p 8888 -p 80 -p 7180 -d <Name of the Image> /usr/bin/docker-quickstart

Duration: 1 day


  • Part 1: Introduction to Hadoop and Map Reduce :
    • Hadoop Distributers
    • Hadoop Vs Traditional Data Storage
    • Working with HDFS
    • Basic commands
    • Architecture
  • Part 2: Hive and HBase:
    • HiveQL
    • Hive Data types
    • HBase data model
    • HBase vs RDBMS
    • Client API and REST
  • Part 3: Apache Spark ( PySpark):
    • Basics and RDD
    • Caching & Modules
    • Spark Streaming
    • Spark SQL