Big data overview

Presented By Ladislav Urban

www.syoncloud.com

Ladislav Urban CEO of Syoncloud.
Syoncloud is a consulting company specialized in
Big Data analytics and integration of existing
systems.

WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474

CURRENT SOURCES OF DATA TO
BE PROCESSED AND UTILIZED

Documents

Existing relational databases (CRM, ERP, Accounting, Billing)

E-mails and attachments

Imaging data (graphs, technical plans)

Sensor or device data

Internet search indexing

Log files

Social media

CURRENT SOURCES OF DATA TO
BE PROCESSED AND UTILIZED

Telephone conversations

Videos

Pictures

Clickstreams (clicks from users on web pages)


SCALE OF THE DATA


WHEN DO WE NEED NOSQL / BIG
DATA SOLUTION?

If relational databases do not scale to your traffic needs

If normalized schema of your relational database became too
complex.

If your business applications generate lots of supporting and
temporary data

If database schema is already denormalized in order to
improve response times

If joins in relational databases slow the system down to a crawl


WHEN DO WE NEED NOSQL /
BIG DATA SOLUTION?

We try to map complex hierarchical documents to
Database tables

Documents from different sources require flexible
schema

When more data beats clever algorithms

Flexibility is required for analytics

Queries for values at specific time in history

Need to utilize outputs from many existing systems

WHEN DO WE NEED NOSQL /
BIG DATA SOLUTION?

To analyze unstructured data such as documents, log
files or semi-structured data such as CSV files and
forms


WHAT ARE THE STRONG POINTS
OF RELATIONAL DATABASES?

SQL language. It is well known, standardized and based on
strong mathematical theories.

Database schemas that do not to be modified during
production.

Scalability is not required

Mature security features: Role-based security, encrypted
communications, row and field access control

Full support of ACID transactions (atomicity, consistency,
isolation, durability)


WHAT ARE THE STRONG POINTS
OF RELATIONAL DATABASES?

Support for backup and rollback for data in case of
data loss or corruption.

Relational database do have development, tuning and
monitoring tools with good GUI


Batch vs Real-time Processing

Batch processing is used when real-time processing is
not required, not possible or too expensive.

Conversion of unstructured data such as text files and
log files into more structured records

Transformation during ETL

Ad-hoc analysis of data

Data analytics application and reporting


BATCH PROCESSING INFRASTRUCTURE


BATCH PROCESSING INFRASTRUCTURE

Batch processing systems utilize Map/Reduce and
HDFS implementation in Apache Hadoop.

It is possible to develop batch processing application
in Java using only Hadoop but we should mention
other important systems and how they fit into
Hadoop infrastructure.


APACHE AVRO

In order to process data we need to have information
about data-types and data-schemas.

This information is used for serialization and
deserialization for RPC communications as well as
reading and writing to files.


APACHE AVRO

RPC and serialization system that supports reach
data structures

It uses JSON to define data types and protocols

It serializes data in a compact binary format

Avro supports Schema evolution

Avro will handle missing/extra/modified fields.


SCRIPT LANGUAGE FOR MAP/REDUCE

We need a quick and simple way to create
Map/Reduce transformations, analysis and
applications.

We need a script language that can be used in scripts
as well as interactively on command line.


APACHE PIG


APACHE PIG

High-level procedural language for querying large
semi-structured data sets using Hadoop and the
Map/Reduce Platform

Pig simplifies the use of Hadoop by allowing SQL-like
queries to run on distributed dataset.


APACHE PIG

An example of filtering log file for only Warning messages
that will run in parallel on large cluster.

Given script is automatically transformed into Map/Reduce
program and distributed across Hadoop cluster.

messages = LOAD '/var/log/messages';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
DUMP warns


APACHE PIG
Relational operators that can be used in Pig

FILTER - Select a set of tuples from a relation based on a condition.

FOREACH - Iterate the tuples of a relation, generating a data
transformation.

GROUP - Group the data in one or more relations.

JOIN - Join two or more relations (inner or outer join).

LOAD - Load data from the file system.

ORDER - Sort a relation based on one or more fields.

SPLIT - Partition a relation into two or more relations.

STORE - Store data in the file system.


What if we want to use SQL to create
map/reduce jobs?

Apache Hive is a data warehousing infrastructure
based on the Hadoop

It provides query language called Hive QL, which is
based on SQL.


APACHE HIVE

Hive functions: data summarization, query and
analysis.

It uses system catalog called Hive-Metastore.

Hive is not designed for OLTP or Real-time queries.

It is best used for batch jobs over large sets of
append-only data.


APACHE HIVE


HiveQL language supports ability to

Filter rows from a table using a where clause.

Select certain columns from the table using a select clause.

Do equi-joins between two tables.

Evaluate aggregations on multiple "group by" columns for the
data stored in a table.

Store the results of a query into another table.

Download the contents of a table to a local (NFS) directory.


HiveQL language supports ability to

Store the results of a query in a HDFS directory.

Manage tables and partitions (create, drop and alter).

Plug in custom scripts in the language of choice for custom
map/reduce jobs.


APACHE OOZIE

Map/Reduce jobs, Pig Scripts and Hive queries
should be simple and single purposed.

How can we create complex ETL or data analysis in
Hadoop?

We chain scripts so output of one script is an input
for another.

Complex workflows that represents real-world
scenarios need workflow engine such as Apache
Oozie.


APACHE OOZIE

Oozie is a server based Workflow Engine specialized in
running workflow jobs with actions that run Hadoop
Map/Reduce, Pig jobs and other.

Oozie workflow is a collection of actions arranged in DAG
(Directed Acyclic Graph).

This means that second action can not run until the first
one is completed.

Oozie workflows definitions are written in hPDL (a XML
Process Definition Language similar to JBOSS JBPM jPDL).


APACHE OOZIE

Workflow actions start jobs in Hadoop cluster. Upon
action completion, the Hadoop callback Oozie to
notify the action completion, at this point Oozie
proceeds to the next action in the workflow.

Oozie workflows contain control flow nodes (start,
end, fail, decision, fork and join) and action nodes
(Actual Jobs).

Workflows can be parameterized (using variables
like ${inputDir} within the workflow definition)


Example of OOZIE workflow definition


workflow-app name='wordcount-wf'
xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-
tracker>
<name-node>${nameNode}</name-
node>
<configuration>
<property>

<name>mapred.mapper.class</name>

<value>org.myorg.WordCount.Map</value>
</property>
<property>

<name>mapred.reducer.class</name>

WWW.SYONCLOUD.COM
<value>org.myorg.WordCount.Reduce</value : 077 9664 6474
E-MAIL : INFO@SYONCLOUD.COM MOBILE

< property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>

<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: $
{wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

APACHE Sqoop


APACHE Sqoop

Apache Sqoop is a tool for transferring bulk data between
Apache Hadoop and structured datastores such as
relational databases or data warehouses.

It can be used to populate tables in Hive and HBase.

Sqoop integrates with Oozie, allowing you to schedule
and automate import and export tasks.

Sqoop uses a connector based architecture which
supports plugins that provide connectivity to external
systems.


APACHE Sqoop

Sqoop includes connectors for databases such as
MySQL, PostgreSQL, Oracle, SQL Server, DB2 and
generic JDBC connector.

Transferred dataset is sliced up into partitions and
map-only job is launched with individual mappers
responsible for transferring a slice of this dataset.

Sqoop uses the database metadata to infer data types


Apache Sqoop – Import to HDFS


APACHE Sqoop
Sqoop example to import data from MySQL database ORDERS
table to Hive table running on Hadoop.
sqoop import --connect jdbc:mysql://localhost/acmedb
--table ORDERS --username test --password **** --hive-
import

Sqoop takes care of populating Hive metastore with
appropriate metadata for the table and also invokes necessary
commands to load the table or partition.


Apache Sqoop – Export to Database


APACHE FLUME
▪ Is a distributed system to reliably collect, aggregate and
move large amounts of log data from many different
sources to a centralized data store.


APACHE FLUME


APACHE FLUME

Flume Source consumes events delivered to it by an
external source like a web server.

When a Flume Source receives an event, it stores it
into one or more Channels.

The Channel is a passive store that keeps the event
until it is consumed by a Flume Sink.

The Sink removes the event from the Channel and
puts it into an external repository like HDFS


APACHE FLUME FEATURES

It allows to build multi-hop flows where events travel through
multiple agents before reaching the final destination.

It also allows fan-in and fan-out flows, contextual routing and
backup routes (fail-over) for failed hops.

Flume uses a transactional approach to guarantee reliable
delivery of events.

Events are staged in the channel, which manages recovery from
failure.

Flume supports log stream types such as Avro, Syslog, Netcat .


DISTCP - DISTRIBUTED COPY

DistCp (distributed copy) is a tool used for large
inter/intra-cluster copying.

It uses Map/Reduce for its distribution, error handling
and recovery and reporting.

It expands a list of files and directories into input to map
tasks, each of which will copy a partition of the files
specified in the source list.


REAL-TIME PROCESSING – NOSQL
DATABASES
▪ 5.1 Document stores
Apache CouchDB, MongoDB,
▪ 5.2 Graph Stores
Neo4j
▪ 5.3 Key-Value Stores
Apache Cassandra, Riak
▪ 5.4 Tabular Stores
Apache Hbase


CAP THEOREM


HBASE ARCHITECTURE


QUESTIONS & ANSWERS

www.syoncloud.com
LADISLAV URBAN
info@syoncloud.com

Mobile : 077 9664 6474

Big data overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data overview

Similar to Big data overview (20)

Big data overview