Posts

Showing posts from October, 2020

NOSQL - MongoDB

Image
  MongoDB - One of most popular NOSQL Database. IT stores data using MapReduce concepts and data is tired as KEy , Value pairs or as Documents . > db.places.insert(place1) Production

Apache Spark

Image
Apache Spark Apache Spark is a framework for real time data analytics in a distributed computing environment. The Spark is written in Scala and was originally developed at the University of California, Berkeley. It executes in-memory computations to increase speed of data processing over Map-Reduce. It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other optimizations.  Therefore, it requires high processing power than Map-Reduce. As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow.  Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities. This is a very common question in everyone’s mind:  “Apache Spark: A Killer or Saviour of Apache Hadoop?” – O’Reily  The Answer t...

What is Hadoop

Image
  Hadoop   is an open-source software framework used for storing and processing   Big Data   in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license. Hadoop  was developed, based on the paper written by Google on the  Map Reduce   system and it applies concepts of functional programming. Hadoop is written in the  Java   and ranks among the highest-level Apache projects. Hadoop was developed by  Doug Cutting  and  Michael   J. Cafarella.  it all started with two people,  Mike Cafarella  and  Doug Cutting,  who were in the process of building a  search engine system that can index 1 billion pages . After their research, they estimated that such a system will cost around  half a million dollars  in hardware, with a monthly running cost of  $30,000,  which is quite expensive. However, they soon realized that their architec...

What is BIG DATA

Image
  As Name suggests BIG Data is huge data created due to IOT, Apps and real time application data colelction • Walmart handles more than 1 million customer transactions every hour. • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • 230+ millions of tweets are created every day. • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. The three different formats of big data are: 1. Structured: Organised data format with a fixed schema. Ex: RDBMS 2. Semi-Structured: Partially organised data which does not have a fixed format. Ex: XML, JSON 3. Unstructured: Unorganised data with an unknown schema. Ex: Audio, video files etc. Main Challanges  • Validity: correctness of data • Variability: dynamic behaviour • Volatility: tendency to change in time • Vulnerability: vulnerable to breach or attacks • Visualization: visualizing meaningful usage of data Solutions - Hadoop Pig Hive CAssandra Spark Kafka e...