Chapter 6 1 min read
Save

Big Data Technologies

Data Science and Analytics · BCA · Updated Apr 23, 2026

Table of Contents

Big Data Technologies

Big data refers to datasets too large, fast, or complex for traditional tools. Big data technologies enable storage, processing, and analysis of massive datasets.

5 Vs

Volume (TB to PB), Velocity (real-time streaming), Variety (structured, semi-structured, unstructured), Veracity (quality), Value (extracting insights).

Hadoop

Open-source distributed framework. HDFS: distributed file system with replication. MapReduce: parallel processing (Map: process, Reduce: aggregate). YARN: resource management. Ecosystem: Hive (SQL), Pig, HBase (NoSQL).

Apache Spark

In-memory processing, 10-100x faster than MapReduce. Components: Spark SQL, Spark Streaming, MLlib (ML), GraphX. Supports Python (PySpark), Scala, Java, R.

Data Lakes

Store raw data in native format until needed. Schema-on-read (vs warehouse schema-on-write). Technologies: HDFS, S3, Azure Data Lake. Delta Lake and Apache Iceberg add ACID transactions.

Stream Processing

Kafka: distributed event streaming (publish-subscribe). Spark Streaming: micro-batch. Apache Flink: true streaming with exactly-once semantics. Use cases: fraud detection, live dashboards, IoT.

NoSQL Databases

Document (MongoDB), key-value (Redis, DynamoDB), column-family (Cassandra, HBase), graph (Neo4j). Choose based on data model and access patterns.

Summary

Big data technologies — Hadoop, Spark, data lakes, streaming, NoSQL — enable processing massive datasets for modern data-intensive applications.

Related Notes

Discussion

0 comments

Join the discussion

Log in to share your thoughts and help fellow students.

Log in to comment

No comments yet. Be the first to share your thoughts!