Big Data is a term used to describe the collection of large data and that grows exponentially over time.
The data is so big and complex than any of the traditional data management tools you can store or process them efficiently.
But we must understand that everyone data that can be stored, accessed and processed in a fixed format is called 'structured' data.
The fact that are managed on large scales, in which solutions must be implemented that are able to handle, store and analyze large amounts of data in a short time
When looking at figures that are handled on a large scale, one can easily understand why the name 'Big Data' is given and imagine the challenges of storing and processing.
That is why today we are going to learn about some popular open source tools that can be used to create a data analysis platform.
Apache Hadoop is an open source software platform that processes very large data sets in a distributed environment.
This tool is based on storage, computational power, and mainly in low-cost basic hardware.
Apache Hadoop is designed to easily scale from a few to thousands of servers.
It helps you process locally stored data in a general parallel processing configuration.
One of the benefits of Hadoop is that it handles failure at the software level. Apache Hadoop provides a framework for the file system layer, the cluster management layer, and the processing layer.
It leaves an option for other projects and frameworks to come in and work together with the Hadoop Ecosystem and develop their own framework for any of the layers available in the system.
Elasticsearch is a full-text-based search and analytics engine. It is a system highly scalable and distributed, specifically designed to work efficiently and quickly with big data systems, where one of its main use cases is log analysis.
It is capable of advanced and complex searches and near real-time processing for advanced analysis and operational intelligence.
Elasticsearch is written in Java and is based on Apache Lucene, Elasticsearch is based on a JSON document with a schema-free structure, making it easy and easy to adopt.
It is one of the leading business grade search engines. You can write your client in any programming language; Elasticsearch officially works with Java, .NET, PHP, Python, Perl, etc.
MongoDB is a NoSQL database based on the document data model. In MongoDB everything is a collection or document.
To understand MongoDB terminology, collection is an alternate word for table, while document is an alternate word for rows.
MongoDB is an open source, document-oriented, cross-platform database. It is written mainly in C ++.
It is also the leading NoSQL database offering high performance, high availability, and easy scalability.
MongoDB uses JSON-like documents with schema and provides great query support. Some of its main functions include indexing, replication, load balancing, aggregation, and file storage.
Cassandra is an open source Apache project designed for managing NoSQL databases.
Cassandra's rows are organized in tables and indexed by a key. It uses an append-only, record-based storage engine.
Data in Cassandra is distributed across multiple master nodes, without a single point of failure. It is a high-level Apache project, and its development is currently overseen by the Apache Software Foundation (ASF).
Cassandra is designed to solve problems associated with operation on a large scale (web).
Given Cassandra's master architecture, it can continue to operate despite a small (though significant) number of hardware failures. Cassandra runs on multiple nodes in multiple data centers.
Replicate data in these data centers to avoid failure or downtime. This makes it a highly fault tolerant system.