Introduction to
Big Data
AHMED SHOUMAN
Our agenda
 Demystify the term "Big Data"
 Find out what is Hadoop
 Explore the realms of batch and real-time big data processing
 Explore challenges of size, speed and scale in databases
 Skim the surface of big-data technologies
 Provide ways into the big-data world
Big Data
Demystified
What is big data?
 Big data is a collective term for a set technologies designed
for storage, querying and analysis of extremely large data sets,
sources and volumes.
 Big data technologies come in where traditional off-the-shelf
databases, data warehousing systems and analysis tools fall
short.
How did we end up with so much data?
 Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
 Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud
 An Important Side Note
Big Data technologies are based on the concept of clustering - Many computers
working in sync to process chunks of our data.
Not just size
 Big data isn't just about data size, but also about data volume,
diversity and inter-connectedness.
Big data is
 Any attribute of our data that challenges either technological capabilities
or business needs, like:
 Scaling, moving, storage and retrieval of ever-growing generated data
 Processing many small data points in real-time
 Analysing diverse semi-structured data from multiple sources
 Querying multiple, diverse data sources in real-time
Breath... Let's recap
 Lot's of data due to technological capabilities and social paradigms
 Not just size! Diversity, volume and inter-connectedness also count
 Scale, speed, processing, querying and analysis
 Challenges technological capabilities or business needs
Hadoop The Elephant in the Room
Everyone talks about Hadoop
 Hadoop is a powerful platform for batch analysis of large
volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell
Hadoop explained
 Hadoop is a horizontally scalable, fault-tolerant, open-source file system
and batch-analysis platform capable of processing large amounts of data.
 HDFS - Hadoop File System
 M/R - Hadoop Map-Reduce platform
Hadoop explained
 HDFS is an ever-growing file system. We can store lots and
lots of data on it for later use.
 HDFS is used as the underlying platform for other
technologies likeHadoop M/R, Apache Mahout or HBase.
Hadoop explained
 Imagine we want to look at 30 days worth of access logs to identify site
usage patterns at a volume of 30M log entries per day.
 Hadoop M/R is a platform that allows us to query HDFS data in parallel for
the purpose of batch (offline) data processing and analysis.
Why is Hadoop so important?
 Scalable and fault-tolerant
 Handles massive amounts of data
 Truly parallel processing
 Data can be semi-structured or unstructured (schemaless)
 Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
Hadoop - Words of caution
 Complex
 Not for real-time
 Choose a distribution (Cloudera, HW, MapR) for better interoperability
 Requires trained DevOps for day-to-day operations
Breath....
 We demystified the term Big Data and glimpsed at Hadoop. Now What?
 How do I really get into the Big Data world?
The world of big data
 Batch & Data Science
 DBs
 Real-Time
Batch Processing
Hadoop M/R
Batch processing of large data sets
 We collect data for the purpose of providing end-users with better
experience in our business domain. This means we have to constantly
query our data and divine new insights and relevant information.
 The problem is doing that in very large scales is a painful, slow challenge.
How do we do this on Hadoop data?
Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Batch processing of large data sets
 Hadoop gives us the basic tools for large data processing in
the form of M/R.
However, Hadoop M/R is pretty annoying to work with
directly as it lacks a lot of relevant tools for the job (statistical
analysis, machine learning etc.)
Source: http://xiaochongzhang.me/blog/?p=338
Hadoop querying and data science
tools
 Tool Purpose
 Hive Write SQL-like M/R queries on top of Hadoop
 Shark Hive-compatible, distributed SQL query engine for Hadoop
 Pig Write scripted M/R queries on top of Hadoop
 Impala Real-time SQL-like queries of Hadoop
 Mahout Scalable machine-learning on top of Hadoop M/R
The gentle way in
 Hive or Shark are a great place to start due to their SQL-like nature
 Shark is faster than Hive - less frustration
 You need some Hadoop data to work with (consider Avro)
 Remember - it's SQL-like, not SQL
 Start small, locally and grow to production later
 Check out Apache Sqoop for moving processed Hadoop data to your DB
Databases In the big data world
Databases in the big data world
 The Problem: Traditional RDBMS were not designed for storing, indexing
and querying growing amounts and volumes of data.
 The 3S Challenge:
 Size - How much data is written and read
 Speed - How fast can we write and read data
 Scale - How easily can our DB scale to accommodate more data
The 3S Challenge
 There's no single, simple solution to the 3S challenge. Instead,
solutions focus on making an informed sacrifice in one area in
order to gain in another area.
NoSQL and C.A.P.
 NoSQL is a term referring to a family of DBMS that attempt to resolve the
3S challenge by sacrificing one of three areas:
 Consistency - All clients have the same view of data
 Availability - Each client can always read and write
 Partition Tolerance - System works despite physical network failures
NoSQL and C.A.P.
 C.A.P. means you have to make an informed choice (and sacrifice)
 No single perfect solution
 Opt for mixed solutions per use-case
 Remember we're talking about read/write volume, not just size
Confused? Let's take a breath and focus
OK, so where do I go from here?
 Identify your needs and limitations
 Choose a few candidates
 Research & Prototype
 Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB
(omitted due to time constraints).
Real-Time Big Data Now!
Real-Time big data processing
 Processing big data in real-time is about data volumes rather than just size.
For example, given a rate of 100K ops/sec, how do I do the following in
real-time?:
 Find anomalies in a data stream (spam)
 Group check-ins by geo
 Identify trending pages / topics
Hadoop isn't for real-time processing
 When it comes to data processing and analysis, Hadoop's M/R framework
is wonderful for batch (offline) processing.
 However, processing, analysing and querying Hadoop data in real-time is
quite difficult.
Apache Storm and Apache Spark
 Apache Storm and Apache Spark are two frameworks for large-scale,
distributed data processing in real-time.
 One could say that both Storm and Spark are for real-time data processing
what is Hadoop M/R for batch data processing.
Apache Storm - Highlights
 Runs on the JVM (Clojure / Java mix)
 Fully distributed and fault-tolerant
 Highly-scalable and extremely fast
 Interoperability with popular languages (Scala, Python etc.)
 Mature and production ready
 Hadoop interoperability via Storm-YARN
 Stateless / Non-Persistent (Data brought to processors)
Apache Spark - Highlights
 Fully distributed and extremely fast
 Write applications in Java Scala and Python
 Perfect for both batch and real-time
 Combine Hadoop SQL (Shark), Machine Learning and Data streaming
 Native Hadoop interoperability
 HDFS, HBase, Cassandra, Flume as data sources
 Stateful / Persistent (Processors brought to data)
Storm & Spark - Use Cases
 Continuous/Cyclic Computation
 Real-time analytics
 Machine Learning (eg. recommendations, personalisation)
 Graph Processing (eg. social networks) - Only Spark
 Data Warehouse ETL (Extract, Transform, Load)
Recap
Term Purpose
 Big Data Collective term for data-processing solutions at scale
 Hadoop Scalable file-system and batch processing platform
 Batch Processing Sifting and analysing data offline / in background
 M/R Parallel, batch data-processing algorithm
 3S Challenge Size, Speed, Scale of DBs
 C.A.P Consistency, Availability, Partition Tolerance
 NoSQL Family of DBMS that grew due to the 3S Challenge
 NewSQL Family of DBMS that provide ACID at scale
Questions?!
Feel free to drop my a line:
Email: ahmed.sayed.shouman@gmail.com

Big Data Concepts

  • 1.
  • 2.
    Our agenda  Demystifythe term "Big Data"  Find out what is Hadoop  Explore the realms of batch and real-time big data processing  Explore challenges of size, speed and scale in databases  Skim the surface of big-data technologies  Provide ways into the big-data world
  • 3.
  • 4.
    What is bigdata?  Big data is a collective term for a set technologies designed for storage, querying and analysis of extremely large data sets, sources and volumes.  Big data technologies come in where traditional off-the-shelf databases, data warehousing systems and analysis tools fall short.
  • 5.
    How did weend up with so much data?  Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine  Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud  An Important Side Note Big Data technologies are based on the concept of clustering - Many computers working in sync to process chunks of our data.
  • 6.
    Not just size Big data isn't just about data size, but also about data volume, diversity and inter-connectedness.
  • 7.
    Big data is Any attribute of our data that challenges either technological capabilities or business needs, like:  Scaling, moving, storage and retrieval of ever-growing generated data  Processing many small data points in real-time  Analysing diverse semi-structured data from multiple sources  Querying multiple, diverse data sources in real-time
  • 8.
    Breath... Let's recap Lot's of data due to technological capabilities and social paradigms  Not just size! Diversity, volume and inter-connectedness also count  Scale, speed, processing, querying and analysis  Challenges technological capabilities or business needs
  • 9.
  • 10.
    Everyone talks aboutHadoop  Hadoop is a powerful platform for batch analysis of large volumes of both structured and unstructured data. From: Conquering Hadoop with Haskell
  • 11.
    Hadoop explained  Hadoopis a horizontally scalable, fault-tolerant, open-source file system and batch-analysis platform capable of processing large amounts of data.  HDFS - Hadoop File System  M/R - Hadoop Map-Reduce platform
  • 12.
    Hadoop explained  HDFSis an ever-growing file system. We can store lots and lots of data on it for later use.  HDFS is used as the underlying platform for other technologies likeHadoop M/R, Apache Mahout or HBase.
  • 13.
    Hadoop explained  Imaginewe want to look at 30 days worth of access logs to identify site usage patterns at a volume of 30M log entries per day.  Hadoop M/R is a platform that allows us to query HDFS data in parallel for the purpose of batch (offline) data processing and analysis.
  • 14.
    Why is Hadoopso important?  Scalable and fault-tolerant  Handles massive amounts of data  Truly parallel processing  Data can be semi-structured or unstructured (schemaless)  Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
  • 15.
    Hadoop - Wordsof caution  Complex  Not for real-time  Choose a distribution (Cloudera, HW, MapR) for better interoperability  Requires trained DevOps for day-to-day operations
  • 16.
    Breath....  We demystifiedthe term Big Data and glimpsed at Hadoop. Now What?  How do I really get into the Big Data world?
  • 17.
    The world ofbig data  Batch & Data Science  DBs  Real-Time
  • 18.
  • 19.
    Batch processing oflarge data sets  We collect data for the purpose of providing end-users with better experience in our business domain. This means we have to constantly query our data and divine new insights and relevant information.  The problem is doing that in very large scales is a painful, slow challenge.
  • 20.
    How do wedo this on Hadoop data? Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
  • 21.
    Batch processing oflarge data sets  Hadoop gives us the basic tools for large data processing in the form of M/R. However, Hadoop M/R is pretty annoying to work with directly as it lacks a lot of relevant tools for the job (statistical analysis, machine learning etc.)
  • 22.
  • 23.
    Hadoop querying anddata science tools  Tool Purpose  Hive Write SQL-like M/R queries on top of Hadoop  Shark Hive-compatible, distributed SQL query engine for Hadoop  Pig Write scripted M/R queries on top of Hadoop  Impala Real-time SQL-like queries of Hadoop  Mahout Scalable machine-learning on top of Hadoop M/R
  • 24.
    The gentle wayin  Hive or Shark are a great place to start due to their SQL-like nature  Shark is faster than Hive - less frustration  You need some Hadoop data to work with (consider Avro)  Remember - it's SQL-like, not SQL  Start small, locally and grow to production later  Check out Apache Sqoop for moving processed Hadoop data to your DB
  • 25.
    Databases In thebig data world
  • 26.
    Databases in thebig data world  The Problem: Traditional RDBMS were not designed for storing, indexing and querying growing amounts and volumes of data.  The 3S Challenge:  Size - How much data is written and read  Speed - How fast can we write and read data  Scale - How easily can our DB scale to accommodate more data
  • 27.
    The 3S Challenge There's no single, simple solution to the 3S challenge. Instead, solutions focus on making an informed sacrifice in one area in order to gain in another area.
  • 28.
    NoSQL and C.A.P. NoSQL is a term referring to a family of DBMS that attempt to resolve the 3S challenge by sacrificing one of three areas:  Consistency - All clients have the same view of data  Availability - Each client can always read and write  Partition Tolerance - System works despite physical network failures
  • 29.
    NoSQL and C.A.P. C.A.P. means you have to make an informed choice (and sacrifice)  No single perfect solution  Opt for mixed solutions per use-case  Remember we're talking about read/write volume, not just size
  • 30.
    Confused? Let's takea breath and focus
  • 32.
    OK, so wheredo I go from here?  Identify your needs and limitations  Choose a few candidates  Research & Prototype  Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB (omitted due to time constraints).
  • 33.
  • 34.
    Real-Time big dataprocessing  Processing big data in real-time is about data volumes rather than just size. For example, given a rate of 100K ops/sec, how do I do the following in real-time?:  Find anomalies in a data stream (spam)  Group check-ins by geo  Identify trending pages / topics
  • 35.
    Hadoop isn't forreal-time processing  When it comes to data processing and analysis, Hadoop's M/R framework is wonderful for batch (offline) processing.  However, processing, analysing and querying Hadoop data in real-time is quite difficult.
  • 36.
    Apache Storm andApache Spark  Apache Storm and Apache Spark are two frameworks for large-scale, distributed data processing in real-time.  One could say that both Storm and Spark are for real-time data processing what is Hadoop M/R for batch data processing.
  • 37.
    Apache Storm -Highlights  Runs on the JVM (Clojure / Java mix)  Fully distributed and fault-tolerant  Highly-scalable and extremely fast  Interoperability with popular languages (Scala, Python etc.)  Mature and production ready  Hadoop interoperability via Storm-YARN  Stateless / Non-Persistent (Data brought to processors)
  • 38.
    Apache Spark -Highlights  Fully distributed and extremely fast  Write applications in Java Scala and Python  Perfect for both batch and real-time  Combine Hadoop SQL (Shark), Machine Learning and Data streaming  Native Hadoop interoperability  HDFS, HBase, Cassandra, Flume as data sources  Stateful / Persistent (Processors brought to data)
  • 39.
    Storm & Spark- Use Cases  Continuous/Cyclic Computation  Real-time analytics  Machine Learning (eg. recommendations, personalisation)  Graph Processing (eg. social networks) - Only Spark  Data Warehouse ETL (Extract, Transform, Load)
  • 40.
  • 41.
    Term Purpose  BigData Collective term for data-processing solutions at scale  Hadoop Scalable file-system and batch processing platform  Batch Processing Sifting and analysing data offline / in background  M/R Parallel, batch data-processing algorithm  3S Challenge Size, Speed, Scale of DBs  C.A.P Consistency, Availability, Partition Tolerance  NoSQL Family of DBMS that grew due to the 3S Challenge  NewSQL Family of DBMS that provide ACID at scale
  • 42.
    Questions?! Feel free todrop my a line: Email: ahmed.sayed.shouman@gmail.com