Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Scaling up with
Hadoop
&
Parallel Processing Framework (Banyan)
LatentView Analytics
2015

Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo

Agenda
Introducing LatentView Analytics
6 Demo

LatentView Analytics
25
Developed
Solutions for
Fortune
500 Firms 500
Over
People Strong
1000Experience in
Analytics
With more than
years of
combined
Followers on Social Media
30K
Engaging
LatentView in the News
LatentView won the Deloitte Technology Fast 50
India awards for 6 consecutive years (2009 – 13)
‘Top Innovator’ awarded to LatentView by
Developer Week (Conference & Festival 2013)
LatentView was a Top Finalist in the ‘We Love
Our Workplace 2013’ category. Reflecting global
recognition of our workplace culture.
LatentView is Advanced Consulting Partner with
Amazon Web Services
 Build Reporting and Analytics Centers of
Excellence (COEs)
 Analyze Business problems both Qualitatively &
Quantitatively and provide actionable insights
 Onsite-Offshore Global Delivery model that helps
in-house teams do more with less
 Provide Thought Leadership in Data Science
Services provided by LatentView
LatentView is an Alliance Partner with Tableau

Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey
analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis
Mobile
PC
Tablet
Signal
&
Wireless
Data
Servers
&
Cloud
Social
User Profile
Surveys &
Reviews
Travel &
Location
Performance
System Logs
&
Database data
Unstructured Data
Work @ LatentView
Different Data Sources & Formats Technology & Predictive Analysis Tool Kits
Data Engineering & Advanced Analytics
Infrastructure
Databases
Predictive Modelling
CXO Dashboards & Visualization

Agenda
Data Processing Frameworks and a brief history of Hadoop
6 Demo

Data Processing Frameworks
Distributed Processing Parallel Processing
Distributed Processing Characteristics
• Master/Slave or Peer-Peer architecture
• Data Replication and redundancy
• Fault tolerant, Shared Memory
• Centralized Job distribution
• Efficient Job scheduling
• Coordinated Resource Management
• Process Structured & Semi-Structured data
• Examples:
• Hadoop, Spark, Storm
Parallel Processing Characteristics
• Shared Nothing Massively Parallel architecture
• Common or Independent Storage
• Independent Memory & Processor space
• Random Job distribution
• Self managed resources & worker
• Dynamic load balanced cluster
• Process Unstructured data
• Examples:
• Banyan

The Search Context
• Gerard Salton, Father of Modern Search Technology
• Salton’s Magic Automatic Retriever of Text
• Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values
SMART (informational retrieval system)
Project Xanadu
ARPANet
Archie Query Form
FTP & WWW
Ask, AltaVista, Yahoo, Google, Bing
• Ted Nelson – Coined the Term Hyper Text
• Create a Computer Network with a simple UI to solve social problems like attribution
• Inspired creation of WWW
• Advanced Research Projects Agency Network
• Led to Internet
• First Implementation of TCP/IP stack
• Document Search & Find Tool
• Script-based data gatherer with a regular expression matcher for retrieving file
• A database of web filenames which it would match with the users queries
• Enter Tim Berners Lee
• httpd, TCP, DNS – Connected it all
• A database of web filenames which it would match with the users queries
What is the biggest problem that the Search Engines of the
last two decades solve?

 Project Lucene was written by Doug Cutting in 1999. It was written purely in
JAVA
 It was written with an intention of helping in creating an open source web
engine
 Lucene is just an indexing and search library and does not contain crawling and
HTML parsing functionality.
Building Lucene
 Ported Nutch Algorithms to Hadoop
 Yahoo Hires Doug Cutting!
 Apache Hadoop comes into picture to support Map Reduce & HDFS
 Yahoo’s Grid team adopts Hadoop
 Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours
Algorithms in Hadoop
 Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark
run on 500 nodes in 42 hours (better hardware than April benchmark).
 Research cluster upgraded to 600 nodes
 In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
 Yahoo! Announced that its production search index was being generated by
10000-core Hadoop
Hadoop Benchmarks!
 As of 2008 - Loading 10 terabytes of data per day on to research clusters
 17 clusters with total of 24,000 nodes
 In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400
nodes) and 100TB sort in 173 minutes(on 3,400 nodes)
 Last.fm, Facebook, New York Times
Hadoop In Action!
 Lucene was not able to crawl or parse HTML by itself. So, a sub project was
developed under it which was called Nutch
 Doug Cutting & Mike Cafarella
 Highly modular architecture, allows developers to create plug-ins for media-
type parsing, data retrieval, querying and clustering.
Building Nutch
 Google File System paper was presented
 NDFS was developed based on the paper
 Google released another paper on MapReduce that Revolutionized the
Hadoop development
 MapReduce tries to collocate the data with the compute node, so data access
is fast since it is local. This is known as Data Locality
The Heart of Hadoop – Distributed File System & Map Reduce
A Brief History of Hadoop
Year: 1999
Key Challenges Addressed
 Efficient Indexing of the results for easy retrieval
Year: 2002
 Efficient Crawling of the World wide web at Scale
Year: 2003
File System
Year: 2005
Year: 2006Algorithms in Hadoop
 Cost & Time efficient hardware andsoftware
Year: 2008 Hadoop in Action!
 Distributed Processing, Data Warehousing & Analysis
 UI & Tools
 Hadoop Distributions
 Apache Hadoop, Apache Bigtop
 Hadoop as a Platform
 Cloudera, HortonWorks, MapR
 Hadoop as a Service
 GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights
Hadoop Services Ecosystem
 Master Slave Architecture
 Batch Vs Real Time Stream processing
 Relational database Vs NoSQL database
 Data fragmentation and management
 Parallel processing job requirements
 Efficient energy management
Elephant be it, has it‘s limitations!

Applications of Big Data Processing
Predict Galaxy types and shapes
Analyzing Life Forms
Weather Forecast
Traffic Management
Disaster Recovery
Personal Health Care
Science & Engineering Environmental Management Intelligent Devices & IoT

Agenda
Solving the Big Data Problem with Hadoop, Spark & Storm
6 Demo

Apache Hadoop & Family
Identify the Apache Hadoop Components!

The Apache Hadoop Stack
Hadoop Distributed File System
YARN/Map Reduce V2
Pig Hive Mahout Oozie
Hbase
Flume
Sqoop
Hadoop User Experience (HUE)
ML Workflow
Columnar
data
store
Scripting SQL
Coordination
ZooKeeperData
Exchange
Log
Control

Walking the Talk with Hadoop – Let’s Architect…
People you may know on LinkedIn.
You might know me, if people that you know, know me!
foreach u in UserList:
foreach x in Connections(u):
foreach y in Connections(x):
if(y not in Connections(u)):
Count(u, y)++;
Sort (u, y) in descending order of Count(u, y);
Choose Top 3 y;
Store (u, {y0, y1, y2..}) for serving;

Simplest ever Map-Reduce example
Mapper is a function that transforms
the input data in required format,
without aggregating.
Mapped_List = Mapper(Input_List)
Ex:
Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapper = Square()
Mapped_List = Square(Input_List)
Mapped_List
= Square(1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
What is a Map ? What is a Reduce?
Reducer is a function that aggregates
the input data in required format.
Output_List = Reducer(Mapped_List)
Ex:
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
Reducer = Sum()
Output_List = Sum(Mapped_List)
Output_List
= Sum(1, 4, 9, 16, 25, 36, 49, 64, 81)
Output_List = 285
Characteristics of Map Reduce
Map is inherently parallel process,
where each list element is processed
independently
Reduce is inherently sequential, unless
multiple lists are processed at a time –
in parallel
Grouping is done to produce multiple
lists to avail parallelism
Input  Partition  Map  Sort 
Shuffle  Reduce  Output
Native MapReduce , Hadoop Streaming

The MapReduce Process
Map
In  (Key1, Value1)
Out  List(Key2, Value2)
Input --- (Filtering, Transformation) --- Output
Reduce
In  List(Key2, List(Value2))
Out  List(Key3, Value3)
Aggregation
Shuffle
In  (Key2, Value2)
Out  Sort(Partition(Key2, List(Value2)))
Movement / copy of data

The MapReduce Process with a Deck of Cards!
Map in Parallel Shuffle/Group Reduce
Sum()
Sum()
Sum()
Sum()
Sum()

Hadoop Security
Centralized framework for collecting access audit history and easy
reporting on the data.
Provides Kerberos based authentication. Kerberos can be
connected to corporate LDAP environments to centrally provision
user information.
Supports encrypting data when it is is transferred and at rest and
masking capalbilities for desenstizing PII information
Ensures users have access to only to data as per corporate policies.
Provides fine-grained authorization via file permissions in HDFS,
recsource level access control for YARN & MapReduce
Security requirements consistently applied across the platform and
can be mangaged centrally with a single interface
Audit
Data Protection
Authorization
Authentication
Centralized Seurity Administration
Difference between Authentication & Authorization ?

A Brief note on Spark & Storm
What do you think is the most time consuming aspect of Hadoop Processes?
How to improve the I/O Limitation?
Result: Faster Analytics
How to achieve event driven real
time analytics?
Result: Highly customized
service response

Hadoop Distributions & Service Providers

Agenda
The Unstructured Maze
6 Demo

Data Deluge – A big problem
PC
Tablet Mobile
SocialSearch
&
Mail
E-Commerce

Tree based Unstructured Feature Extraction
Panel & Web Logs
Social
Rules
Engine
Data
Parser
• Tweets
• Comments
• Likes
• Shares
• Blogs
• Reviews
• Clickstream
• HTML
• Images
• Audio*
• Video*
Feature Type Detail
Feature 1 Image 600*400
Feature 2 Link #
Feature 3 Price 200$
Feature 4 Star 3.5
Tweet Time View
Tweet1 12:00 Positive
Tweet 2 12:05 Neutral
Tree Based Parser

Agenda
Banyan – A Parallel Processing Framework
6 Demo

Banyan – Parallel processing framework at scale
• Is your data unstructured ?
Ex: HTML, Images, URLs, Audio, Video, Documents, Text
• Is processing each input independent of processing other input?
Ex: Compressing one image is independent of next image
• Do you need to solve the two problems above at web scale?
Ex: say 1 Million documents to processed in less than 1 hour
We handle what Hadoop can’t
handle!
Rather, We handle what Hadoop
isn’t supposed to handle – Parallel
Processing & Unstructured data!
Banyan is a parallel processing
framework well integrated with
cloud platform of your choice!
Follow us here:
Banyan – Embarrassingly Parallel
Processing Framework Linked In
Group
http://www.growbanyan.com
Email us :
runparallel@latentview.com

Banyan Vs Hadoop (Yes or No type of comparison)
Characteristics Banyan Hadoop
Job Type Embarrassingly Parallel Processing Distributed Processing
Master Slave Architecture
Shared Nothing Architecture
Data Replication
Fault tolerance
Coordinated Job Distribution
Dynamical Load Balancing
Rescheduling Job Failures
Process Structured data
Process Unstructured data
Note:
The core advantage of Banyan is best utilized
when Data Processing & Analysis (Aggregation)
are executed in a decoupled fashion for jobs
that can be processed in parallel

Follow us here!
http://www.growbanyan.com
Banyan – Embarrassingly Parallel Processing Framework Linked In Group
Email us : runparallel@latentview.com

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Similar to Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy (20)

Recently uploaded

Recently uploaded (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy