Big Data Applications
Juan Pablo Paz Grau, PhD, PMP.
Juan Pablo Paz Grau, PhD, PMP
Systems Engineer
Specialist in Information Systems Management
PhD in Software Engineering
Certified in ITIL Foundation, PMP
Currently, I work in LG CNS Colombia
LG CNS Colombia is the IT partner of the SIRCI operation
The SIRCI Operation = Transmilenio Operation
Transmilenio is the world renown reference for BRT systems
The biggest public traffic system operation in Colombia
Presentation Agenda
1. What is Big Data?
2. Large Dataset Management Techniques
3. Hadoop Cluster Architecture
4. Closing the Loop: Real Time Cluster Architecture
5. The Development Process for Big Data Systems
6. Showcase of Big Data Tools for Public Traffic Systems
What is Big Data?
The DIKW Triangle
What is Big Data?
Information displayed
to final users
Data generated to
provide information
displayed to final
users
…
What is Big Data?
• Organizations produce lots of
data while they operate their
Information Systems
• Log files
• Access log files
• Debug log files
• Temporal, transient data
• Transactional data
• Usually, this data is stored
temporarily only for debugging
or incident analysis purposes
• With the increasing capacity to
store data, this data is been
reviewed and considered a
valuable source of information
Large Dataset Management Techniques
Very small intro to Hadoop
Cheap, reliable storage of
big datasets in commodity
hardware
A framework to parallelize
big data processing and
analysis
What is Hadoop?
Large Dataset
Large Dataset Management Techniques
Very small intro to Hadoop: Hadoop Distributed File System (HDFS)
File is split in
data blocks
File metadata and block
location is stored in the
name node
Data blocks are physically
stored in data nodes
Block B:
• If Data Node 0 fails, there is another
copy in the same rack at Data Node 1
• If the rack fails, there is still another
copy in another rack at Data Node 2
Rack 1 Rack 2
Large Dataset Management Techniques
• Very small intro to Hadoop: Map Reduce
Map: Select data that
matches a given criteria
(Status = Trip). The map
function returns a set of
{Key,Value} pairs
Shuffle: Collect an
sort the mapped pairs
Reduce: Apply a
reduce function (Sum
distance) for each key
Large Dataset Management Techniques
Very small intro to Hadoop: The Hadoop ecosystem
• Currently, there are a plethora of tools to work
with Big Data in top of Hadoop.
• The tools and frameworks selection will vary
depending on the implementation of the cluster.
Hadoop Cluster Architecture
The Lambda Architecture
Application
Data Access
Batch | Speed
Data
• Data layer: A data model and a set of data stored
following the data model. The data model should
be designed for the targeted subsystem.
• Batch layer: The computation layer that
processes data to turn facts into views for
querying the underlying stored data.
• Speed layer: A real time computation layer that
compensates the latency of the batch layer.
• Data Access layer: The engines, tools and
drivers that exposes views to applications and
manages queries.
• Application layer: The front-end application or
applications that present information to users of
the Big Data system.
Hadoop Cluster Architecture
Data Serialization
Source System
Source System
Source System
Data Serialization
Data Serialization
Data Serialization
Data Lake
Source System
Raw Data
Data Access: Hive, Hadoop Data Warehouse
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of managing data in Hadoop
• Manage files and schemas as tables
• Internal tables: Files managed by Hive
• External tables: Files located outside
of Hive but which can be analyzed with
Hive
• Provides a SQL like language to query data
stored in files
• Translates HiveQL language requests
into Map Reduce jobs
HiveQL
Load Transform Dump
Data Access: Pig, Data Processing Language
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of data processing and
analysis
• Capable of working with any type of data
source
• Provides a scripting language to process and
transform data
Pig
Latin
Hadoop Cluster Architecture
Hive
• Works with structured data
• Can index data
• HiveQL, a SQL like access language
• Turns the HiveQL input into MapReduce
jobs
Pig
• Works with structured/unstructured data
• Cannot index data
• Pig latin, a scripting language
• Turns the Pig latin input into MapReduce
jobs
Hive / Pig Comparison
Closing the Loop: Real Time Cluster Architecture
Why?
1. Hadoop is intended to store history, not changing data (write
once, read many times)
2. Batch processing of data usually takes many time to produce
output summarized data
3. Capability to provide real time processing of Big Data is also
desirable in the Lambda architecture
4. There is a need to implement a solution to cope with the time
between data in the Hadoop cluster and new data been
generated
Data available
in Hadoop
New data
been created
New data
stored in
Hadoop
Data
Gap
Time
Closing the Loop: Real Time Cluster Architecture
Cassandra: Accessing the Cluster
CQL Driver
CQL
1. Used to be through a thrift client, now CQL client
2. CQL (Cassandra QL), a very small subset of SQL
3. Driver is not JDBC like!
Cassandra: Data Model
1. Row oriented, instead of column oriented
2. Each row is identified by a key
3. Each key accesses a collection of columns
The Development Process for Big Data Systems
Development Process: System Implementation
Hadoop Cluster Architecture
Master Node
• Resource Manager
• Name Node
• Hive Server
• Sqoop
• Apache Tomcat
• MySQL Server
Worker Node Worker Node Worker Node Worker Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
Now, we have the cluster services up and running,
and data is flowing into our Big Data repository.
What´s next?
Showcase of Big Data Tools for Public Traffic Systems

Big data applications

  • 1.
    Big Data Applications JuanPablo Paz Grau, PhD, PMP.
  • 2.
    Juan Pablo PazGrau, PhD, PMP Systems Engineer Specialist in Information Systems Management PhD in Software Engineering Certified in ITIL Foundation, PMP Currently, I work in LG CNS Colombia LG CNS Colombia is the IT partner of the SIRCI operation The SIRCI Operation = Transmilenio Operation Transmilenio is the world renown reference for BRT systems The biggest public traffic system operation in Colombia
  • 3.
    Presentation Agenda 1. Whatis Big Data? 2. Large Dataset Management Techniques 3. Hadoop Cluster Architecture 4. Closing the Loop: Real Time Cluster Architecture 5. The Development Process for Big Data Systems 6. Showcase of Big Data Tools for Public Traffic Systems
  • 4.
    What is BigData? The DIKW Triangle
  • 5.
    What is BigData? Information displayed to final users Data generated to provide information displayed to final users …
  • 6.
    What is BigData? • Organizations produce lots of data while they operate their Information Systems • Log files • Access log files • Debug log files • Temporal, transient data • Transactional data • Usually, this data is stored temporarily only for debugging or incident analysis purposes • With the increasing capacity to store data, this data is been reviewed and considered a valuable source of information
  • 7.
    Large Dataset ManagementTechniques Very small intro to Hadoop Cheap, reliable storage of big datasets in commodity hardware A framework to parallelize big data processing and analysis What is Hadoop? Large Dataset
  • 8.
    Large Dataset ManagementTechniques Very small intro to Hadoop: Hadoop Distributed File System (HDFS) File is split in data blocks File metadata and block location is stored in the name node Data blocks are physically stored in data nodes Block B: • If Data Node 0 fails, there is another copy in the same rack at Data Node 1 • If the rack fails, there is still another copy in another rack at Data Node 2 Rack 1 Rack 2
  • 9.
    Large Dataset ManagementTechniques • Very small intro to Hadoop: Map Reduce Map: Select data that matches a given criteria (Status = Trip). The map function returns a set of {Key,Value} pairs Shuffle: Collect an sort the mapped pairs Reduce: Apply a reduce function (Sum distance) for each key
  • 10.
    Large Dataset ManagementTechniques Very small intro to Hadoop: The Hadoop ecosystem • Currently, there are a plethora of tools to work with Big Data in top of Hadoop. • The tools and frameworks selection will vary depending on the implementation of the cluster.
  • 11.
    Hadoop Cluster Architecture TheLambda Architecture Application Data Access Batch | Speed Data • Data layer: A data model and a set of data stored following the data model. The data model should be designed for the targeted subsystem. • Batch layer: The computation layer that processes data to turn facts into views for querying the underlying stored data. • Speed layer: A real time computation layer that compensates the latency of the batch layer. • Data Access layer: The engines, tools and drivers that exposes views to applications and manages queries. • Application layer: The front-end application or applications that present information to users of the Big Data system.
  • 12.
    Hadoop Cluster Architecture DataSerialization Source System Source System Source System Data Serialization Data Serialization Data Serialization Data Lake Source System Raw Data
  • 13.
    Data Access: Hive,Hadoop Data Warehouse Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of managing data in Hadoop • Manage files and schemas as tables • Internal tables: Files managed by Hive • External tables: Files located outside of Hive but which can be analyzed with Hive • Provides a SQL like language to query data stored in files • Translates HiveQL language requests into Map Reduce jobs HiveQL
  • 14.
    Load Transform Dump DataAccess: Pig, Data Processing Language Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of data processing and analysis • Capable of working with any type of data source • Provides a scripting language to process and transform data Pig Latin
  • 15.
    Hadoop Cluster Architecture Hive •Works with structured data • Can index data • HiveQL, a SQL like access language • Turns the HiveQL input into MapReduce jobs Pig • Works with structured/unstructured data • Cannot index data • Pig latin, a scripting language • Turns the Pig latin input into MapReduce jobs Hive / Pig Comparison
  • 16.
    Closing the Loop:Real Time Cluster Architecture Why? 1. Hadoop is intended to store history, not changing data (write once, read many times) 2. Batch processing of data usually takes many time to produce output summarized data 3. Capability to provide real time processing of Big Data is also desirable in the Lambda architecture 4. There is a need to implement a solution to cope with the time between data in the Hadoop cluster and new data been generated Data available in Hadoop New data been created New data stored in Hadoop Data Gap Time
  • 17.
    Closing the Loop:Real Time Cluster Architecture Cassandra: Accessing the Cluster CQL Driver CQL 1. Used to be through a thrift client, now CQL client 2. CQL (Cassandra QL), a very small subset of SQL 3. Driver is not JDBC like! Cassandra: Data Model 1. Row oriented, instead of column oriented 2. Each row is identified by a key 3. Each key accesses a collection of columns
  • 18.
    The Development Processfor Big Data Systems Development Process: System Implementation Hadoop Cluster Architecture Master Node • Resource Manager • Name Node • Hive Server • Sqoop • Apache Tomcat • MySQL Server Worker Node Worker Node Worker Node Worker Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node
  • 19.
    Now, we havethe cluster services up and running, and data is flowing into our Big Data repository. What´s next? Showcase of Big Data Tools for Public Traffic Systems

Editor's Notes

  • #3 This is the question that your experiment answers