Big Data Technologies

Big data refers to datasets too large, fast, or complex for traditional tools. Big data technologies enable storage, processing, and analysis of massive datasets.

5 Vs

Volume (TB to PB), Velocity (real-time streaming), Variety (structured, semi-structured, unstructured), Veracity (quality), Value (extracting insights).

Hadoop

Open-source distributed framework. HDFS: distributed file system with replication. MapReduce: parallel processing (Map: process, Reduce: aggregate). YARN: resource management. Ecosystem: Hive (SQL), Pig, HBase (NoSQL).

Apache Spark

In-memory processing, 10-100x faster than MapReduce. Components: Spark SQL, Spark Streaming, MLlib (ML), GraphX. Supports Python (PySpark), Scala, Java, R.

Data Lakes

Store raw data in native format until needed. Schema-on-read (vs warehouse schema-on-write). Technologies: HDFS, S3, Azure Data Lake. Delta Lake and Apache Iceberg add ACID transactions.

Stream Processing

Kafka: distributed event streaming (publish-subscribe). Spark Streaming: micro-batch. Apache Flink: true streaming with exactly-once semantics. Use cases: fraud detection, live dashboards, IoT.

NoSQL Databases

Document (MongoDB), key-value (Redis, DynamoDB), column-family (Cassandra, HBase), graph (Neo4j). Choose based on data model and access patterns.

Summary

Big data technologies — Hadoop, Spark, data lakes, streaming, NoSQL — enable processing massive datasets for modern data-intensive applications.

Big Data Technologies