Big Data , Big Problem?

Scalable Big Data Architecture
Big Data Big Problem?
PRESENTATION BY :
MOHAMMAD HASAN FARAZMAND
OCTOBER 2016
M.H.FARAZMAND@GMAIL.COM

We Will Review…
 Identifying Big Data Symptoms
 Size Matters
 Typical Business Use Case
 Understanding the Big Data Project’s Ecosystem
 Hadoop Distribution
 Data Acquisition
 Processing Language
 Machine Learning
 NoSQL Stores
 Foundation of long-term Big Data Architecture
 Architecture Overview
 Long Ingestion Application
 Learning Application
 Processing Engine
 Search Engine

This presentation has been prepared based on the first chapter of
Scalable Big Data Architecture
by
Bahaaldine Azarmi

Identifying Big Data Symptoms
 Data management is more complex than it has been before!
 Big Data is every where , on every one’s mind
 When Should I think about employing Big Data ?
 Am I ready?
 What should I start with?!
 Different needs :
 The volume of data you handle
 Variety of data structure
 Scalability issue
 Reduce the cost of data processing

Size Matters
 Two main areas : Size + Volume
 Handle new data structures with flexible & schemaless technology
 Big data is also about extracting added value information
 Near real time processing with distributed architecture
 Execute complex queries with NoSQL store
Value

Typical Business Use Case
Analyzing application’s log, web access log, server log, DB log, Social
Networks
 Customer Behavior Analytics : Used on e-commerce websites
 Sentiment Analysis : Images and reputation of companies which
perceived across social networks.
 CRM On Boarding : Combine online data sources with offline data
sources for better and more accurate customer segmentation ( profile-
customized offers)
 Prediction : Learning from Data , main big data trend (for 2 past years) –
For example in telecommunication industry :
1) Issue or event prediction based on router log
2) Product catalog selection
3) Pricing depending on user’s global behavior

Understanding Big Data Project’s Ecosystem
Choosing …
 Hadoop distribution
 Distributed file system
 SQL-Like processing language
 Machine learning language
 Scheduler
 Message-oriented middleware
 NoSQL data store
 Data visualization

Hadoop Distribution
Two Choices :
 Download the project you need separately
 Use one of most popular Hadoop distribution

Cloudera CDH
1. Impala : realtime, parallelized, SQL based engine that searches for
data in HDFS and Base.
2. Cloudera Management : Cloudera’s console to manage and
deploy Hadoop components.
3. Hue : Console for user interaction with data and scripts

Hadoop Distributed File System
HDFS
Key features:
 Distribution
 High Availability
 Fault Tolerance
 Tuning
 Security
 Load Balancing
 High Throughput Access
Automatic replication across the cluster data nodes

Data Acquisition
 Large log file, Streamed data, ETL processing outcome, Online
unstructured data, Offline structured data, etc.
ApacheFlume
 Reliable, Highly available, Simple, Flexible, Intuitive programming
model based on streaming data flows.
 Composed of “Sources”,”Channels”,”Sinks”

Apache Sqoop
 Transfer bulk data between structured data store and HDFS.
 Import data from external relational database to HDFS, Hbase , Hive.
 Export data from Hadoop cluster to a relational database or data
warehouse.

Processing Language
 MapReduce was the main processing framework in the first
generation of the Hadoop cluster.
 Grouping sibling data together (Map) and then aggregating the
data in depending on a specified aggregation operation (Reduce).
 Now that YARN (Yet Another Resource Negotiator) has been
implemented.

Batch Processing with Hive
 Hive, which brings users the simplicity and power of querying data
from HDFS in a SQL-like way.
 Hive is not a near or real-time processing language. It is long-term
processing job with a low priority
 Main drawback of using another language rather than using native
MapReduce, is “Performance”.

Stream Processing with Spark Streaming
 Extension of Spark.
 Leveraging Spark’s distributed data processing framework and treats
streaming computation.
 Spark Streaming lets you write a processing job as you would do for
batch processing in Java, Scale, or Python.
 Foundation of a strong fault-tolerant and high-performance system.

Message-Oriented Middleware
with Apache Kafka
 Persistent messaging and high-throughput system.
 Kafka as a pivot point in our architecture mainly to receive data
and push it into Spark Streaming.

Machine Learning
 Spark MLlib enables machine learning for Spark.
 Composed of various algorithms that go from basic statistics, logistic
regression, k-means clustering, and Gaussian mixtures to singular
value decomposition and multinomial naive Bayes.
 Train your data and build prediction models with a few lines of code

NoSQL Stores
 Fundamental pieces of the data architecture.
 Scalability and Resiliency, and thus High Availability.
 Ingest a very large amount of data.

Couchbase
 Document-oriented NoSQL database that is easily scalable,
provides a flexible model, and is consistently high performance.
ElasticSearch
 Scalable distributed indexing engine and search features.
 Based on Apache Lucene and enables real-time data analytics
and full-text search in your architecture.

ELK platform
 ElasticSearch is part of the ELK platform.
 ElasticSearch + Logstash + Kibana
 Provide the best end-to-end platform for collecting, storing, and
visualizing data.
 Logstash lets you collect data from many kinds of sources
 ElasticSearch indexes the data in a distributed, scalable, and
resilient system.
 Kibana is a customizable user interface in which you can build a
simple to complex dashboard to explore and visualize data indexed
by ElasticSearch.

Foundation of a Long-Term
Big Data Architecture

Log Ingestion Application
 Consume application logs such as web access logs.

Learning Application
 Receives a stream of data and builds prediction to optimize our
recommendation engine.

Processing Engine
 Heart of the architecture

Summary
 The search engine leverages the data processed by the processing
engine and exposes a dedicated RESTful API that will be used for
analytic purposes.
Search Engine
 We have seen all the components that make up our architecture

Big Data , Big Problem?

More Related Content

What's hot

Viewers also liked

Similar to Big Data , Big Problem?

Recently uploaded

Big Data , Big Problem?