BigData Analysis

Big Data Analysis Tools & Methods
Spring 2015
OCCC - Tehran

Personal Profile:
●
Ehsan Derakhshan
●
Founder & CEO at innfinision Cloud & BigData Solutions
●
More than 15 year experience (Telecom & Datacom)
●
Ehsan.derakhshan@innfinision.net
●
Innfinision.net

About innfinision:
●
Providing Cloud, Virtualization and Data Center Solutions
●
BigData Management - Analysis & Development Solutions
●
Developing Software for Cloud Environments
●
Providing Services to Telecom, Education, Banking & more...
●
Supporting OpenStack Foundation as the First Iranian Company
●
Partner of : Docker - MongoDB - RedHat

BigData Analysis Tools & Methods innfinision.net
●
What is Data & BigData?
●
Important Questions
●
Tools & Solutions
●
Advantages - Why & Where
Agenda:

What is Data & BigData ?
innfinision.netBigData Analysis Tools & Methods

What is Data?
Data is a collection of facts, such as numbers, words, measurements, observations or
even just descriptions of things.
Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and
bytes stored in electronic memory, or as facts stored in a person's mind. Strictly
speaking, data is the plural of datum, a single piece of information.

Big data can be described by the following characteristics:
1- Volume
2- Velocity
3- Variety
4- Variability
5- Veracity
6- Complexity
7- & etc
Of information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making

Important Questions

Important Question:
Can a database really deliver quantifiable business advantage?
To some, the database is a low-level infrastructure component of a much larger
application -- something that only developers, DBAs and operations staff need to
care or worry about.
However, in the digital economy, data is the raw currency. How an organization
stores, manages, analyzes and uses data has a direct impact on its success -- and its
costs. Its choice of database affects how quickly it can deliver new applications to
market, support business growth and improve customer experience.

Consider these examples:
- After trying for eight years to build a single view of their customer, one of the
world's leading insurance companies changed database and delivered the project
in just three months
- A leading telecommunications provider adopted a new database technology and
were able to accelerate time to market by 4x, reduce engineering costs by 50%
and improve customer experience by 10x
- A Tier 1 investment bank rebuilt its globally-distributed reference data platform
on a new database technology, enabling it to save an estimated $40M over five
years
Singles can now find their ideal partner 95% faster after one of the world’s leading
relationship providers switched data and machine learning to a new platform

So Why is database selection becoming so critical?
Because the requirements of modern applications and the demands of
sophisticated, data-savvy users are changing.
Data is being generated at much faster rates than ever before and can yield
insights never previously possible. The data no longer fits neatly into structured
rows and columns. Windows of market opportunity are getting smaller. Underlying
infrastructure is being commoditized, with powerful systems available for just
pennies per hour.
The database chosen by a project team can be the enabler -- or the blocker -- to
success. All of the assumptions that have dictated database selection over the
past 30 years are being revisited as a result of the factors discussed above.

Challenges for DataBase Selection:
- Risk tolerance for bugs and unmapped behaviors
- HA
- Redundancy
- Access- and location-based requirements
- Security requirements
- Skill sets and tooling
- Architecture and infrastructure
- Growth expectations and the timeline therein (Scalable)
- Support? Community?
- Free Schema (Flexible Data Model)
- Scale Out
- Real-time
- Rich Queries
- Migration
- Drivers
- Faster
- Agile
- Backup/Restore
- Monitoring & …

Tools & Solutions

Innfinision BigData Solutions:
1- MongoDB :
MongoDB (from 'humongous') is a Scalable, High performance, OpenSource,
Schema-free, Document-Oriented Database.
MongoDB provides high performance, high availability, and easy scalability.
Document Database. Documents (objects) map nicely to programming language
data types. Embedded documents and arrays reduce need for joins. Dynamic
schema makes polymorphism easier.
2- PyTables :
PyTables is a package for managing hierarchical datasets and designed to efficiently
cope with extremely large amounts of data.
It is built on top of the HDF5 library and the NumPy package. It features an object-
oriented interface that, combined with C extensions for the performance-critical
parts of the code (generated using Cython), makes it a fast, yet extremely easy to
use tool for interactively save and retrieve very large amounts of data. One
important feature of PyTables is that it optimizes memory and disk resources so
that they take much less space (between a factor 3 to 5, and more if the data is
compressible) than other solutions, like for example, relational or object oriented
databases.

3- Blosc :
Blosc is a high performance compressor optimized for binary data. It has been
designed to transmit data to the processor cache faster than the traditional, non-
compressed, direct memory fetch approach via a memcpy OS call. Blosc is the first
compressor (that I'm aware of) that is meant not only to reduce the size of large
datasets on-disk or in-memory, but also to accelerate memory-bound
computations.
4- Blaze :
Blaze is a high-level user interface for databases and array computing systems. It
consists of the following components:
- A symbolic expression system to describe and reason about analytic queries
- A set of interpreters from that query system to various databases /
computational engines
This architecture allows a single Blaze code to run against several computational
backends. Blaze interacts rapidly with the user and only communicates with the
database when necessary. Blaze is also able to analyze and optimize queries to
improve the interactive experience.

Advantages - Why - Where

MongoDB Advantages :
Any relational database has a typical schema design that shows number of tables
and the relationship between these tables. While in MongoDB there is no concept of
relationship.
Advantages of MongoDB over RDBMS
-- Schema less : MongoDB is document database in which one collection holds
different different documents. Number of fields, content and size of the
document can be differ from one document to another.
-- Structure of a single object is clear.
-- No complex joins.
-- Deep query-ability. MongoDB supports dynamic queries on documents using a
document-based query language that's nearly as powerful as SQL
-- Tuning
-- Ease of scale-out. MongoDB is easy to scale
- Conversion / mapping of application objects to database objects not needed
Uses internal memory for storing the (windowed) working set, enabling faster
access of data

Why should use MongoDB?
- Document Oriented Storage : Data is stored in the form of JSON style
documents
- Index on any attribute
- Replication & High Availability
- Auto-Sharding
- Rich Queries
- Fast In-Place Updates
- Professional Support
Where should use MongoDB?
- Big Data
- Content Management and Delivery
- Mobile and Social Infrastructure
- User Data Management
- Data Hub

Why should use PyTables?
PyTables can be used on any scenario where you need to save and retrieve large
amounts of data and provide metadata (that is, data about actual data) for it.
Whether you want to work with large datasets of (potentially multidimensional)
data, save and structure your NumPy datasets or just to provide a categorized
structure for some portions of your cluttered RDBMS, then give PyTables a try. It
works well for storing data from data acquisition systems, sensors in geosciences,
simulation software, network data monitoring systems or as a centralized
repository for system logs, to name only a few possible uses.
However, it's important to emphasize the fact that PyTables is not designed to
work as a relational database competitor, but rather as a teammate. For example,
if you have very large tables in your existing relational database, then you can
move those tables to PyTables so as to reduce the burden of your existing
database while efficiently keeping those huge tables on-disk.

Why should use Blosc?
- multi-threaded compressor that can transmit data from caches to memory, and
back,
- speed can be larger than a OS memcpy()
Why Shoud Use Blaze?
Because Blaze is a query system that looks like NumPy/Pandas. You write Blaze
queries, Blaze translates those queries to something else (like SQL), and ships
those queries to various database to run on other people's fast code. It smoothes
out this process to make interacting with foreign data as accessible as using
Pandas. This is actually quite difficult.

Ehsan Derakhshan
Ehsan.Derakhshan@innfinision.net
innfinision.net
Thank you

BigData Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigData Analysis

Similar to BigData Analysis (20)

More from Innfinision Cloud and BigData Solutions

More from Innfinision Cloud and BigData Solutions (6)

Recently uploaded

Recently uploaded (20)

BigData Analysis