2. What is Big Data?
• Big Data refers to the large
amounts of data pouring in
from various data sources &
has different
formats(structured, semi-
structured & unstructured)
• Because of the varied nature of
this Data, the traditional
relational database systems
are incapable of handling this
data.
3. What are Big Data Technologies & Why
these are needed?
• It can be defined as a Software-Utility that is designed to
Analyse , Process and Extract the information from an
extremely complex and large data sets which the
Traditional Data Processing Software could never deal
with.
• We need big Data Processing Technologies to Analyse this
huge amount of Real-time data and come up with
Conclusions and Predictions to reduce the risks in the
future.
4. Top Big Data Technologies
Top big data technologies are divided into four fields based on their usage:
• Data Storage : Big data storage is a storage infrastructure that is designed
specifically to store , manage & retrieve massive amounts of data or big
data. It enable quick processing & retrieval of big quantities of data.
• Data Analytics: Data analytics is the process of inspecting , cleansing ,
transforming & modelling data with the goal of discovering useful
information ,informing conclusions & supporting decision making.
• Data Mining: Data mining involves exploring & analyzing large amounts of
data to find patterns for big data. The goal of data mining is either
classification or prediction.
• Data Visualisation : Data Visualisation is the practice of translating
information into a visual context , such as a map or graph, to make data
easier for human brain to understand.
5. Open Source Big Data Technologies for
Storage & Management
• Apache Hadoop:
• The Apache Hadoop software
library is a big data framework .
HDFS is used for storing data.
• It allows distributed processing of
data sets across clusters of
computers.
• Developed by: Apache Software
Foundation in the year 2011 on 10
Dec.
• Written in: JAVA
• Companies using it: Microsoft,
IBM, Intel, MAPR, cloudera,
Hortonworks etc.
• Cassandra:
• Apache Cassandra database
provides an effective management
of large amounts of data.
• Supports replication of data
across multiple data centers for
scalability.
• Offers very good fault tolerance
and low latency.
• Devloped by: Apache Software
Foundation in the year 2008 in
july.
• Written in: JAVA
• Companies using it: Netflix ,
Walmart , Uber , McDonalds etc.
6. • Mongo DB:
• Mongo DB is an open source No
SQL database which is cross-
platform compatible with many
built-in features.
• Developed by: Mongo DB in the
year 2009 on 11 Feb.
• Written in: C++ , Go , JavaScript ,
Python
• Apache Hbase:
• Apache HBase is a popular &
highly efficient Column-oriented
Nosql database built on top of
HDFS that allows performing
read/write operations on large
datasets in real time using
key/Value data.
• Developed by: Apache Software
Foundation in the year 2008 on
28 March.
• Written in: JAVA
7. Open Source Big Data Technologies For
Data Analytics
• Apache Spark
• Open source big data tool which
fills the gaps of Apache Hadoop
concerning data processing.
• Spark can handle both batch data
& real-time data.
• As Spark does in- memory data
processing, it processes data much
faster than traditional disk
processing.
• Developed by:Apache Software
Foundation
• Written in: JAVA, Scala, Python ,
R
• Apache Hive:
• It allows programmers analyze
large data sets on Hadoop
• It helps with quering and
managing large datasets real fast
• Developed by: Apache Software
Foundation in year 2010 on 1 oct.
• Written in: JAVA
8. • Hadoop MapReduce:
• Programming model or pattern
used to access big data stored in
the Hadoop File System(HDFS)
• Facilitates processing by splitting
petabytes of data into smaller
chunks
• The logic is executed on the server
where the data already resides
which makes the process quicker.
• Apache kafka:
• Distributed streaming platform.
• It aims to provide a unified , high
throughput , low-latency platform
for handling real-time data feeds.
• Developed by: Apache Software
Foundation in the year 2011
• Written in: Scala, JAVA
9. Open Source Big data Technologies for
Data Mining
• Presto:
• Open Source Distributed SQL
Query Engine for running analytic
queries against data sources of all
sizes ranging from gigabytes to
petabytes.
• Developed by: Apache Foundation
in the year 2013.
• Written in: JAVA
• Elastic Search:
• Based on Lucene library.
• It provides a distributed ,
multiTenant-capable , full-text
search engine with an HTTP web
interface and schema –free JSON
documents.
• Developed by: Elastic NV in the
year 2012
• Written in: JAVA
10. Open Source Technologies for Data
Visualisation
• Candela:
• Candala is a data visualisation
package made available through
the Resonant platform.
• It separates itself from other tools
by providing a full suite of data
visualisation tools.
• Charted:
• An open-source tool that
automatically visualizes data.
• Charted is perhaps one of the
easiest data visualisation tools
around, as it simply requires a
link to a .csv file or a google sheets
location; hit Go and charted
creates a visual display using a bar
or line chart.
• Developed by: Product Science
Team in the year 2013