Big Data Concepts

Introduction to
Big Data
AHMED SHOUMAN

Our agenda
 Demystify the term "Big Data"
 Find out what is Hadoop
 Explore the realms of batch and real-time big data processing
 Explore challenges of size, speed and scale in databases
 Skim the surface of big-data technologies
 Provide ways into the big-data world

What is big data?
 Big data is a collective term for a set technologies designed
for storage, querying and analysis of extremely large data sets,
sources and volumes.
 Big data technologies come in where traditional off-the-shelf
databases, data warehousing systems and analysis tools fall
short.

How did we end up with so much data?
 Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
 Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud
 An Important Side Note
Big Data technologies are based on the concept of clustering - Many computers
working in sync to process chunks of our data.

Not just size
 Big data isn't just about data size, but also about data volume,
diversity and inter-connectedness.

Big data is
 Any attribute of our data that challenges either technological capabilities
or business needs, like:
 Scaling, moving, storage and retrieval of ever-growing generated data
 Processing many small data points in real-time
 Analysing diverse semi-structured data from multiple sources
 Querying multiple, diverse data sources in real-time

Breath... Let's recap
 Lot's of data due to technological capabilities and social paradigms
 Not just size! Diversity, volume and inter-connectedness also count
 Scale, speed, processing, querying and analysis
 Challenges technological capabilities or business needs

Hadoop The Elephant in the Room

Everyone talks about Hadoop
 Hadoop is a powerful platform for batch analysis of large
volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell

Hadoop explained
 Hadoop is a horizontally scalable, fault-tolerant, open-source file system
and batch-analysis platform capable of processing large amounts of data.
 HDFS - Hadoop File System
 M/R - Hadoop Map-Reduce platform

Hadoop explained
 HDFS is an ever-growing file system. We can store lots and
lots of data on it for later use.
 HDFS is used as the underlying platform for other
technologies likeHadoop M/R, Apache Mahout or HBase.

Hadoop explained
 Imagine we want to look at 30 days worth of access logs to identify site
usage patterns at a volume of 30M log entries per day.
 Hadoop M/R is a platform that allows us to query HDFS data in parallel for
the purpose of batch (offline) data processing and analysis.

Why is Hadoop so important?
 Scalable and fault-tolerant
 Handles massive amounts of data
 Truly parallel processing
 Data can be semi-structured or unstructured (schemaless)
 Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)

Hadoop - Words of caution
 Complex
 Not for real-time
 Choose a distribution (Cloudera, HW, MapR) for better interoperability
 Requires trained DevOps for day-to-day operations

Breath....
 We demystified the term Big Data and glimpsed at Hadoop. Now What?
 How do I really get into the Big Data world?

The world of big data
 Batch & Data Science
 DBs
 Real-Time

Batch processing of large data sets
 We collect data for the purpose of providing end-users with better
experience in our business domain. This means we have to constantly
query our data and divine new insights and relevant information.
 The problem is doing that in very large scales is a painful, slow challenge.

How do we do this on Hadoop data?
Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial

Batch processing of large data sets
 Hadoop gives us the basic tools for large data processing in
the form of M/R.
However, Hadoop M/R is pretty annoying to work with
directly as it lacks a lot of relevant tools for the job (statistical
analysis, machine learning etc.)

Source: http://xiaochongzhang.me/blog/?p=338

Hadoop querying and data science
tools
 Tool Purpose
 Hive Write SQL-like M/R queries on top of Hadoop
 Shark Hive-compatible, distributed SQL query engine for Hadoop
 Pig Write scripted M/R queries on top of Hadoop
 Impala Real-time SQL-like queries of Hadoop
 Mahout Scalable machine-learning on top of Hadoop M/R

The gentle way in
 Hive or Shark are a great place to start due to their SQL-like nature
 Shark is faster than Hive - less frustration
 You need some Hadoop data to work with (consider Avro)
 Remember - it's SQL-like, not SQL
 Start small, locally and grow to production later
 Check out Apache Sqoop for moving processed Hadoop data to your DB

Databases In the big data world

Databases in the big data world
 The Problem: Traditional RDBMS were not designed for storing, indexing
and querying growing amounts and volumes of data.
 The 3S Challenge:
 Size - How much data is written and read
 Speed - How fast can we write and read data
 Scale - How easily can our DB scale to accommodate more data

The 3S Challenge
 There's no single, simple solution to the 3S challenge. Instead,
solutions focus on making an informed sacrifice in one area in
order to gain in another area.

NoSQL and C.A.P.
 NoSQL is a term referring to a family of DBMS that attempt to resolve the
3S challenge by sacrificing one of three areas:
 Consistency - All clients have the same view of data
 Availability - Each client can always read and write
 Partition Tolerance - System works despite physical network failures

NoSQL and C.A.P.
 C.A.P. means you have to make an informed choice (and sacrifice)
 No single perfect solution
 Opt for mixed solutions per use-case
 Remember we're talking about read/write volume, not just size

Confused? Let's take a breath and focus

OK, so where do I go from here?
 Identify your needs and limitations
 Choose a few candidates
 Research & Prototype
 Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB
(omitted due to time constraints).

Real-Time big data processing
 Processing big data in real-time is about data volumes rather than just size.
For example, given a rate of 100K ops/sec, how do I do the following in
real-time?:
 Find anomalies in a data stream (spam)
 Group check-ins by geo
 Identify trending pages / topics

Hadoop isn't for real-time processing
 When it comes to data processing and analysis, Hadoop's M/R framework
is wonderful for batch (offline) processing.
 However, processing, analysing and querying Hadoop data in real-time is
quite difficult.

Apache Storm and Apache Spark
 Apache Storm and Apache Spark are two frameworks for large-scale,
distributed data processing in real-time.
 One could say that both Storm and Spark are for real-time data processing
what is Hadoop M/R for batch data processing.

Apache Storm - Highlights
 Runs on the JVM (Clojure / Java mix)
 Fully distributed and fault-tolerant
 Highly-scalable and extremely fast
 Interoperability with popular languages (Scala, Python etc.)
 Mature and production ready
 Hadoop interoperability via Storm-YARN
 Stateless / Non-Persistent (Data brought to processors)

Apache Spark - Highlights
 Fully distributed and extremely fast
 Write applications in Java Scala and Python
 Perfect for both batch and real-time
 Combine Hadoop SQL (Shark), Machine Learning and Data streaming
 Native Hadoop interoperability
 HDFS, HBase, Cassandra, Flume as data sources
 Stateful / Persistent (Processors brought to data)

Storm & Spark - Use Cases
 Continuous/Cyclic Computation
 Real-time analytics
 Machine Learning (eg. recommendations, personalisation)
 Graph Processing (eg. social networks) - Only Spark
 Data Warehouse ETL (Extract, Transform, Load)

Term Purpose
 Big Data Collective term for data-processing solutions at scale
 Hadoop Scalable file-system and batch processing platform
 Batch Processing Sifting and analysing data offline / in background
 M/R Parallel, batch data-processing algorithm
 3S Challenge Size, Speed, Scale of DBs
 C.A.P Consistency, Availability, Partition Tolerance
 NoSQL Family of DBMS that grew due to the 3S Challenge
 NewSQL Family of DBMS that provide ACID at scale

Questions?!
Feel free to drop my a line:
Email: ahmed.sayed.shouman@gmail.com

Big Data Concepts

More Related Content

What's hot

Viewers also liked

Similar to Big Data Concepts

More from Ahmed Salman

Recently uploaded

Big Data Concepts