Big data and computing grid

1
2
3
4
5
6
Starting our complex problem
Awesome
Distribute file system
Lightning-Fast Cluster Computing
2

3
About goal
This workshop will help in understanding compute grid, it’s model and it’s
implementation. We have tried my best to explain the concepts in detail. The
programming language used for demo in here is Scala. And we then apply
this model to settle some problems we can see where this is simple and
useful for us.

• Lost of Data (Terabytes or Petabytes)
• Big data is the term of collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing application. The challenges include capture, curation,
storage, search, sharing, transfer, analysis and visualization.
• Systems/ Enterprises generate huge amount of data form Terabytes to and even
Petabytes of information.
NYSE generates about one terabyte of new trade
data/day to perform stock trading analytics to
determine trends for optimal trades.

7
• 2,500 exabytes of new information in 2013
with internet as primary driver
• Digital universe grew by 62% last year to
800K petabytes and will grow to 1.2
zettabytes this years

9
Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even
petabytes—of information.
Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big
data must be used as it streams into your enterprise in order to maximize its value.
Big data is any type of data - structured and unstructured data such as text, sensor data,
audio, video, click streams, log files and more. New insights are found when analyzing these
data types together.

1
2
3
4
10
Recommendation engines
Ad targeting
Search quality
Abuse and Click fraud detection

1
2
3
4
11
Customer churn prevention
Network performance optimization
Calling data record analysis
Analyzing network to predict failure

1
2
3
4
12
Health information exchange
Gene sequencing
Healthcare service quality improvements
Drug safety

1
2
3
4
13
Modeling true risk
Threat analysis
Fraud detection
Credit scoring and analysis

Apache Hadoop is a framework that allows for
distributed processing of large data sets across
clusters of commodity computers using a simple
programming model.
It is an Open-source Data Management with
scale-out storage & distributed processing.
19

22
• Splits a task across processors
• “Near” the data & assembles results
• Self-healing, high bandwidth
• Clustered storage
• JobTracker manages the TaskTrackers
• Distributed across nodes
• Natively redundant
• NameNode tracks locations.

Apache Spark is an open-source cluster-computing
framework for real time processing developed by the Apache
Software Foundation.
Spark provides an interface for programming entire clusters
with implicit data parallelism and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of
computations.
25

29
Map, flatMap, Filter, … Collect, Take, count, …
Resilient Distributed Datasets (RDD): Collection that can be operated in parallel.

31
Hadoop Mapreduce Spark
Time to sort 100TB data

32
Most active open source community in big data
200+ developers, 50+ companies contributing

Big data and computing grid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data and computing grid

Similar to Big data and computing grid (20)

Recently uploaded

Recently uploaded (20)

Big data and computing grid