Apache Spark Apache Spark is a framework for real time data analytics in a distributed computing environment. The Spark is written in Scala and was originally developed at the University of California, Berkeley. It executes in-memory computations to increase speed of data processing over Map-Reduce. It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other optimizations. Therefore, it requires high processing power than Map-Reduce. As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow. Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities. This is a very common question in everyone’s mind: “Apache Spark: A Killer or Saviour of Apache Hadoop?” – O’Reily The Answer t...
Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license. Hadoop was developed, based on the paper written by Google on the Map Reduce system and it applies concepts of functional programming. Hadoop is written in the Java and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella. it all started with two people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that can index 1 billion pages . After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realized that their architec...
As Name suggests BIG Data is huge data created due to IOT, Apps and real time application data colelction • Walmart handles more than 1 million customer transactions every hour. • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • 230+ millions of tweets are created every day. • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. The three different formats of big data are: 1. Structured: Organised data format with a fixed schema. Ex: RDBMS 2. Semi-Structured: Partially organised data which does not have a fixed format. Ex: XML, JSON 3. Unstructured: Unorganised data with an unknown schema. Ex: Audio, video files etc. Main Challanges • Validity: correctness of data • Variability: dynamic behaviour • Volatility: tendency to change in time • Vulnerability: vulnerable to breach or attacks • Visualization: visualizing meaningful usage of data Solutions - Hadoop Pig Hive CAssandra Spark Kafka e...
Comments
Post a Comment