Scalable Big Data Architecture
Big Data Big Problem?
PRESENTATION BY :
MOHAMMAD HASAN FARAZMAND
OCTOBER 2016
M.H.FARAZMAND@GMAIL.COM
We Will Review…
 Identifying Big Data Symptoms
 Size Matters
 Typical Business Use Case
 Understanding the Big Data Project’s Ecosystem
 Hadoop Distribution
 Data Acquisition
 Processing Language
 Machine Learning
 NoSQL Stores
 Foundation of long-term Big Data Architecture
 Architecture Overview
 Long Ingestion Application
 Learning Application
 Processing Engine
 Search Engine
This presentation has been prepared based on the first chapter of
Scalable Big Data Architecture
by
Bahaaldine Azarmi
Identifying Big Data Symptoms
 Data management is more complex than it has been before!
 Big Data is every where , on every one’s mind
 When Should I think about employing Big Data ?
 Am I ready?
 What should I start with?!
 Different needs :
 The volume of data you handle
 Variety of data structure
 Scalability issue
 Reduce the cost of data processing
Size Matters
 Two main areas : Size + Volume
 Handle new data structures with flexible & schemaless technology
 Big data is also about extracting added value information
 Near real time processing with distributed architecture
 Execute complex queries with NoSQL store
Value
Typical Business Use Case
Analyzing application’s log, web access log, server log, DB log, Social
Networks
 Customer Behavior Analytics : Used on e-commerce websites
 Sentiment Analysis : Images and reputation of companies which
perceived across social networks.
 CRM On Boarding : Combine online data sources with offline data
sources for better and more accurate customer segmentation ( profile-
customized offers)
 Prediction : Learning from Data , main big data trend (for 2 past years) –
For example in telecommunication industry :
1) Issue or event prediction based on router log
2) Product catalog selection
3) Pricing depending on user’s global behavior
Understanding Big Data Project’s Ecosystem
Choosing …
 Hadoop distribution
 Distributed file system
 SQL-Like processing language
 Machine learning language
 Scheduler
 Message-oriented middleware
 NoSQL data store
 Data visualization
Hadoop Distribution
Two Choices :
 Download the project you need separately
 Use one of most popular Hadoop distribution
Cloudera CDH
1. Impala : realtime, parallelized, SQL based engine that searches for
data in HDFS and Base.
2. Cloudera Management : Cloudera’s console to manage and
deploy Hadoop components.
3. Hue : Console for user interaction with data and scripts
Hortonworks HDP
Hadoop Distributed File System
HDFS
Key features:
 Distribution
 High Availability
 Fault Tolerance
 Tuning
 Security
 Load Balancing
 High Throughput Access
Automatic replication across the cluster data nodes
Data Acquisition
 Large log file, Streamed data, ETL processing outcome, Online
unstructured data, Offline structured data, etc.
ApacheFlume
 Reliable, Highly available, Simple, Flexible, Intuitive programming
model based on streaming data flows.
 Composed of “Sources”,”Channels”,”Sinks”
Apache Sqoop
 Transfer bulk data between structured data store and HDFS.
 Import data from external relational database to HDFS, Hbase , Hive.
 Export data from Hadoop cluster to a relational database or data
warehouse.
Processing Language
 MapReduce was the main processing framework in the first
generation of the Hadoop cluster.
 Grouping sibling data together (Map) and then aggregating the
data in depending on a specified aggregation operation (Reduce).
 Now that YARN (Yet Another Resource Negotiator) has been
implemented.
Batch Processing with Hive
 Hive, which brings users the simplicity and power of querying data
from HDFS in a SQL-like way.
 Hive is not a near or real-time processing language. It is long-term
processing job with a low priority
 Main drawback of using another language rather than using native
MapReduce, is “Performance”.
Stream Processing with Spark Streaming
 Extension of Spark.
 Leveraging Spark’s distributed data processing framework and treats
streaming computation.
 Spark Streaming lets you write a processing job as you would do for
batch processing in Java, Scale, or Python.
 Foundation of a strong fault-tolerant and high-performance system.
Message-Oriented Middleware
with Apache Kafka
 Persistent messaging and high-throughput system.
 Kafka as a pivot point in our architecture mainly to receive data
and push it into Spark Streaming.
Machine Learning
 Spark MLlib enables machine learning for Spark.
 Composed of various algorithms that go from basic statistics, logistic
regression, k-means clustering, and Gaussian mixtures to singular
value decomposition and multinomial naive Bayes.
 Train your data and build prediction models with a few lines of code
NoSQL Stores
 Fundamental pieces of the data architecture.
 Scalability and Resiliency, and thus High Availability.
 Ingest a very large amount of data.
Couchbase
 Document-oriented NoSQL database that is easily scalable,
provides a flexible model, and is consistently high performance.
ElasticSearch
 Scalable distributed indexing engine and search features.
 Based on Apache Lucene and enables real-time data analytics
and full-text search in your architecture.
ELK platform
 ElasticSearch is part of the ELK platform.
 ElasticSearch + Logstash + Kibana
 Provide the best end-to-end platform for collecting, storing, and
visualizing data.
 Logstash lets you collect data from many kinds of sources
 ElasticSearch indexes the data in a distributed, scalable, and
resilient system.
 Kibana is a customizable user interface in which you can build a
simple to complex dashboard to explore and visualize data indexed
by ElasticSearch.
Foundation of a Long-Term
Big Data Architecture
Log Ingestion Application
 Consume application logs such as web access logs.
Learning Application
 Receives a stream of data and builds prediction to optimize our
recommendation engine.
Processing Engine
 Heart of the architecture
Summary
 The search engine leverages the data processed by the processing
engine and exposes a dedicated RESTful API that will be used for
analytic purposes.
Search Engine
 We have seen all the components that make up our architecture
Good Luck

Big Data , Big Problem?

  • 1.
    Scalable Big DataArchitecture Big Data Big Problem? PRESENTATION BY : MOHAMMAD HASAN FARAZMAND OCTOBER 2016 M.H.FARAZMAND@GMAIL.COM
  • 2.
    We Will Review… Identifying Big Data Symptoms  Size Matters  Typical Business Use Case  Understanding the Big Data Project’s Ecosystem  Hadoop Distribution  Data Acquisition  Processing Language  Machine Learning  NoSQL Stores  Foundation of long-term Big Data Architecture  Architecture Overview  Long Ingestion Application  Learning Application  Processing Engine  Search Engine
  • 3.
    This presentation hasbeen prepared based on the first chapter of Scalable Big Data Architecture by Bahaaldine Azarmi
  • 4.
    Identifying Big DataSymptoms  Data management is more complex than it has been before!  Big Data is every where , on every one’s mind  When Should I think about employing Big Data ?  Am I ready?  What should I start with?!  Different needs :  The volume of data you handle  Variety of data structure  Scalability issue  Reduce the cost of data processing
  • 5.
    Size Matters  Twomain areas : Size + Volume  Handle new data structures with flexible & schemaless technology  Big data is also about extracting added value information  Near real time processing with distributed architecture  Execute complex queries with NoSQL store Value
  • 6.
    Typical Business UseCase Analyzing application’s log, web access log, server log, DB log, Social Networks  Customer Behavior Analytics : Used on e-commerce websites  Sentiment Analysis : Images and reputation of companies which perceived across social networks.  CRM On Boarding : Combine online data sources with offline data sources for better and more accurate customer segmentation ( profile- customized offers)  Prediction : Learning from Data , main big data trend (for 2 past years) – For example in telecommunication industry : 1) Issue or event prediction based on router log 2) Product catalog selection 3) Pricing depending on user’s global behavior
  • 7.
    Understanding Big DataProject’s Ecosystem Choosing …  Hadoop distribution  Distributed file system  SQL-Like processing language  Machine learning language  Scheduler  Message-oriented middleware  NoSQL data store  Data visualization
  • 8.
    Hadoop Distribution Two Choices:  Download the project you need separately  Use one of most popular Hadoop distribution
  • 9.
    Cloudera CDH 1. Impala: realtime, parallelized, SQL based engine that searches for data in HDFS and Base. 2. Cloudera Management : Cloudera’s console to manage and deploy Hadoop components. 3. Hue : Console for user interaction with data and scripts
  • 10.
  • 11.
    Hadoop Distributed FileSystem HDFS Key features:  Distribution  High Availability  Fault Tolerance  Tuning  Security  Load Balancing  High Throughput Access Automatic replication across the cluster data nodes
  • 12.
    Data Acquisition  Largelog file, Streamed data, ETL processing outcome, Online unstructured data, Offline structured data, etc. ApacheFlume  Reliable, Highly available, Simple, Flexible, Intuitive programming model based on streaming data flows.  Composed of “Sources”,”Channels”,”Sinks”
  • 13.
    Apache Sqoop  Transferbulk data between structured data store and HDFS.  Import data from external relational database to HDFS, Hbase , Hive.  Export data from Hadoop cluster to a relational database or data warehouse.
  • 14.
    Processing Language  MapReducewas the main processing framework in the first generation of the Hadoop cluster.  Grouping sibling data together (Map) and then aggregating the data in depending on a specified aggregation operation (Reduce).  Now that YARN (Yet Another Resource Negotiator) has been implemented.
  • 15.
    Batch Processing withHive  Hive, which brings users the simplicity and power of querying data from HDFS in a SQL-like way.  Hive is not a near or real-time processing language. It is long-term processing job with a low priority  Main drawback of using another language rather than using native MapReduce, is “Performance”.
  • 16.
    Stream Processing withSpark Streaming  Extension of Spark.  Leveraging Spark’s distributed data processing framework and treats streaming computation.  Spark Streaming lets you write a processing job as you would do for batch processing in Java, Scale, or Python.  Foundation of a strong fault-tolerant and high-performance system.
  • 17.
    Message-Oriented Middleware with ApacheKafka  Persistent messaging and high-throughput system.  Kafka as a pivot point in our architecture mainly to receive data and push it into Spark Streaming.
  • 18.
    Machine Learning  SparkMLlib enables machine learning for Spark.  Composed of various algorithms that go from basic statistics, logistic regression, k-means clustering, and Gaussian mixtures to singular value decomposition and multinomial naive Bayes.  Train your data and build prediction models with a few lines of code
  • 19.
    NoSQL Stores  Fundamentalpieces of the data architecture.  Scalability and Resiliency, and thus High Availability.  Ingest a very large amount of data.
  • 20.
    Couchbase  Document-oriented NoSQLdatabase that is easily scalable, provides a flexible model, and is consistently high performance. ElasticSearch  Scalable distributed indexing engine and search features.  Based on Apache Lucene and enables real-time data analytics and full-text search in your architecture.
  • 21.
    ELK platform  ElasticSearchis part of the ELK platform.  ElasticSearch + Logstash + Kibana  Provide the best end-to-end platform for collecting, storing, and visualizing data.  Logstash lets you collect data from many kinds of sources  ElasticSearch indexes the data in a distributed, scalable, and resilient system.  Kibana is a customizable user interface in which you can build a simple to complex dashboard to explore and visualize data indexed by ElasticSearch.
  • 22.
    Foundation of aLong-Term Big Data Architecture
  • 23.
    Log Ingestion Application Consume application logs such as web access logs.
  • 24.
    Learning Application  Receivesa stream of data and builds prediction to optimize our recommendation engine.
  • 25.
    Processing Engine  Heartof the architecture
  • 26.
    Summary  The searchengine leverages the data processed by the processing engine and exposes a dedicated RESTful API that will be used for analytic purposes. Search Engine  We have seen all the components that make up our architecture
  • 27.