Big Data Technologies
Big data refers to datasets too large, fast, or complex for traditional tools. Big data technologies enable storage, processing, and analysis of massive datasets.
5 Vs
Volume (TB to PB), Velocity (real-time streaming), Variety (structured, semi-structured, unstructured), Veracity (quality), Value (extracting insights).
Hadoop
Open-source distributed framework. HDFS: distributed file system with replication. MapReduce: parallel processing (Map: process, Reduce: aggregate). YARN: resource management. Ecosystem: Hive (SQL), Pig, HBase (NoSQL).
Apache Spark
In-memory processing, 10-100x faster than MapReduce. Components: Spark SQL, Spark Streaming, MLlib (ML), GraphX. Supports Python (PySpark), Scala, Java, R.
Data Lakes
Store raw data in native format until needed. Schema-on-read (vs warehouse schema-on-write). Technologies: HDFS, S3, Azure Data Lake. Delta Lake and Apache Iceberg add ACID transactions.
Stream Processing
Kafka: distributed event streaming (publish-subscribe). Spark Streaming: micro-batch. Apache Flink: true streaming with exactly-once semantics. Use cases: fraud detection, live dashboards, IoT.
NoSQL Databases
Document (MongoDB), key-value (Redis, DynamoDB), column-family (Cassandra, HBase), graph (Neo4j). Choose based on data model and access patterns.
Summary
Big data technologies — Hadoop, Spark, data lakes, streaming, NoSQL — enable processing massive datasets for modern data-intensive applications.