Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to Apache Hadoop, DFS, and Map Reduce.
To transform your organization and unlock the value of your data, you need a way to ingest, store and analyze every type of data in your organization.
This presentation covers the Data Access Layer of the Hadoop Ecosystem which enables you to achieve this.
We will use the HDP (Hortonworks Data Platform) reference architecture to walk through the Hadoop core and its ecosystem with focus on the data access layer.
We will cover some of the prominent tools of the ecosystem such as Pig, Hive, Sqoop, Flume and Oozie and how they are used for ingesting data into Hadoop from structured, unstructured and streaming sources.
Talk to us at +91 80 6567 9700 or send an email to training@springpeople.com for more information.
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
Learning Objectives - In this module, you will learn the Hadoop Cluster Architecture and Setup, Important Configuration files in a Hadoop Cluster, Data Loading Techniques.
This presentation accompanied a practical demonstration of Amazon's Elastic Computing services to CNET students at the University of Plymouth on 16/03/2010.
The practical demonstration involved an obviously parallel problem split on 5 Medium size AMIs. The problem was the calculation of the Clustering Coefficient and the Mean Path Length (Based on the original work done by Watts and Strogatz) for large networks. The code was written in Python taking advantage of the scipy, pyparallel and networkx toolkits
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
This session will give you an update on what SUSE is up to in the Big Data arena. We will take a brief look at SUSE Linux Enterprise Server and why it makes the perfect foundation for your Hadoop Deployment.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
it just provide information about hadoop what is hadoop and how hadoop overcomes the disadvantage of distributed system and i have also shown an example program for mapreduce
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
3. Agenda
Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
4. LatentView Analytics
25
Developed
Solutions for
Fortune
500 Firms 500
Over
People Strong
1000Experience in
Analytics
With more than
years of
combined
Followers on Social Media
30K
Engaging
LatentView in the News
LatentView won the Deloitte Technology Fast 50
India awards for 6 consecutive years (2009 – 13)
‘Top Innovator’ awarded to LatentView by
Developer Week (Conference & Festival 2013)
LatentView was a Top Finalist in the ‘We Love
Our Workplace 2013’ category. Reflecting global
recognition of our workplace culture.
LatentView is Advanced Consulting Partner with
Amazon Web Services
Build Reporting and Analytics Centers of
Excellence (COEs)
Analyze Business problems both Qualitatively &
Quantitatively and provide actionable insights
Onsite-Offshore Global Delivery model that helps
in-house teams do more with less
Provide Thought Leadership in Data Science
Services provided by LatentView
LatentView is an Alliance Partner with Tableau
5. Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey
analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis
Mobile
PC
Tablet
Signal
&
Wireless
Data
Servers
&
Cloud
Social
User Profile
Surveys &
Reviews
Travel &
Location
Performance
System Logs
&
Database data
Unstructured Data
Work @ LatentView
Different Data Sources & Formats Technology & Predictive Analysis Tool Kits
Data Engineering & Advanced Analytics
Infrastructure
Databases
Predictive Modelling
CXO Dashboards & Visualization
6. Agenda
1 Introducing LatentView Analytics
Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
8. Data Processing Frameworks
Distributed Processing Parallel Processing
Distributed Processing Characteristics
• Master/Slave or Peer-Peer architecture
• Data Replication and redundancy
• Fault tolerant, Shared Memory
• Centralized Job distribution
• Efficient Job scheduling
• Coordinated Resource Management
• Process Structured & Semi-Structured data
• Examples:
• Hadoop, Spark, Storm
Parallel Processing Characteristics
• Shared Nothing Massively Parallel architecture
• Common or Independent Storage
• Independent Memory & Processor space
• Random Job distribution
• Self managed resources & worker
• Dynamic load balanced cluster
• Process Unstructured data
• Examples:
• Banyan
9. The Search Context
• Gerard Salton, Father of Modern Search Technology
• Salton’s Magic Automatic Retriever of Text
• Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values
SMART (informational retrieval system)
Project Xanadu
ARPANet
Archie Query Form
FTP & WWW
Ask, AltaVista, Yahoo, Google, Bing
• Ted Nelson – Coined the Term Hyper Text
• Create a Computer Network with a simple UI to solve social problems like attribution
• Inspired creation of WWW
• Advanced Research Projects Agency Network
• Led to Internet
• First Implementation of TCP/IP stack
• Document Search & Find Tool
• Script-based data gatherer with a regular expression matcher for retrieving file
• A database of web filenames which it would match with the users queries
• Enter Tim Berners Lee
• httpd, TCP, DNS – Connected it all
• A database of web filenames which it would match with the users queries
What is the biggest problem that the Search Engines of the
last two decades solve?
10. Project Lucene was written by Doug Cutting in 1999. It was written purely in
JAVA
It was written with an intention of helping in creating an open source web
engine
Lucene is just an indexing and search library and does not contain crawling and
HTML parsing functionality.
Building Lucene
Ported Nutch Algorithms to Hadoop
Yahoo Hires Doug Cutting!
Apache Hadoop comes into picture to support Map Reduce & HDFS
Yahoo’s Grid team adopts Hadoop
Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours
Algorithms in Hadoop
Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark
run on 500 nodes in 42 hours (better hardware than April benchmark).
Research cluster upgraded to 600 nodes
In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
Yahoo! Announced that its production search index was being generated by
10000-core Hadoop
Hadoop Benchmarks!
As of 2008 - Loading 10 terabytes of data per day on to research clusters
17 clusters with total of 24,000 nodes
In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400
nodes) and 100TB sort in 173 minutes(on 3,400 nodes)
Last.fm, Facebook, New York Times
Hadoop In Action!
Lucene was not able to crawl or parse HTML by itself. So, a sub project was
developed under it which was called Nutch
Doug Cutting & Mike Cafarella
Highly modular architecture, allows developers to create plug-ins for media-
type parsing, data retrieval, querying and clustering.
Building Nutch
Google File System paper was presented
NDFS was developed based on the paper
Google released another paper on MapReduce that Revolutionized the
Hadoop development
MapReduce tries to collocate the data with the compute node, so data access
is fast since it is local. This is known as Data Locality
The Heart of Hadoop – Distributed File System & Map Reduce
A Brief History of Hadoop
Year: 1999
Key Challenges Addressed
Efficient Indexing of the results for easy retrieval
Year: 2002
Efficient Crawling of the World wide web at Scale
Year: 2003
File System
Year: 2005
Year: 2006Algorithms in Hadoop
Cost & Time efficient hardware andsoftware
Year: 2008 Hadoop in Action!
Distributed Processing, Data Warehousing & Analysis
UI & Tools
Hadoop Distributions
Apache Hadoop, Apache Bigtop
Hadoop as a Platform
Cloudera, HortonWorks, MapR
Hadoop as a Service
GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights
Hadoop Services Ecosystem
Master Slave Architecture
Batch Vs Real Time Stream processing
Relational database Vs NoSQL database
Data fragmentation and management
Parallel processing job requirements
Efficient energy management
Elephant be it, has it‘s limitations!
11. Applications of Big Data Processing
Predict Galaxy types and shapes
Analyzing Life Forms
Weather Forecast
Traffic Management
Disaster Recovery
Personal Health Care
Science & Engineering Environmental Management Intelligent Devices & IoT
12. Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
13. Apache Hadoop & Family
Identify the Apache Hadoop Components!
14. The Apache Hadoop Stack
Hadoop Distributed File System
YARN/Map Reduce V2
Pig Hive Mahout Oozie
Hbase
Flume
Sqoop
Hadoop User Experience (HUE)
ML Workflow
Columnar
data
store
Scripting SQL
Coordination
ZooKeeperData
Exchange
Log
Control
15. Walking the Talk with Hadoop – Let’s Architect…
People you may know on LinkedIn.
You might know me, if people that you know, know me!
foreach u in UserList:
foreach x in Connections(u):
foreach y in Connections(x):
if(y not in Connections(u)):
Count(u, y)++;
Sort (u, y) in descending order of Count(u, y);
Choose Top 3 y;
Store (u, {y0, y1, y2..}) for serving;
16. Simplest ever Map-Reduce example
Mapper is a function that transforms
the input data in required format,
without aggregating.
Mapped_List = Mapper(Input_List)
Ex:
Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapper = Square()
Mapped_List = Square(Input_List)
Mapped_List
= Square(1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
What is a Map ? What is a Reduce?
Reducer is a function that aggregates
the input data in required format.
Output_List = Reducer(Mapped_List)
Ex:
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
Reducer = Sum()
Output_List = Sum(Mapped_List)
Output_List
= Sum(1, 4, 9, 16, 25, 36, 49, 64, 81)
Output_List = 285
Characteristics of Map Reduce
Map is inherently parallel process,
where each list element is processed
independently
Reduce is inherently sequential, unless
multiple lists are processed at a time –
in parallel
Grouping is done to produce multiple
lists to avail parallelism
Input Partition Map Sort
Shuffle Reduce Output
Native MapReduce , Hadoop Streaming
17. Simulating Map Reduce
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq –c
What do each of above commands do? What is the output?
Feb 1 18:17:01 ip-10-218-136-14 CRON[21353]: pam_unix(cron:session): session opened for user root by (uid=0)
Feb 1 18:30:01 ip-10-218-136-14 CRON[21373]: pam_unix(cron:session): session opened for user ubuntu by (uid=0)
Feb 1 18:39:01 ip-10-218-136-14 CRON[21387]: pam_unix(cron:session): session opened for user root by (uid=0)
Feb 1 19:09:01 ip-10-218-136-14 CRON[21427]: pam_unix(cron:session): session opened for user root by (uid=0)
mc:~$ cat /var/log/auth.log* | grep "session opened" | less
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq -c
28321 root
86 ubuntu
47635 user
18. The MapReduce Process
Map
In (Key1, Value1)
Out List(Key2, Value2)
Input --- (Filtering, Transformation) --- Output
Reduce
In List(Key2, List(Value2))
Out List(Key3, Value3)
Aggregation
Shuffle
In (Key2, Value2)
Out Sort(Partition(Key2, List(Value2)))
Movement / copy of data
19. The MapReduce Process with a Deck of Cards!
Map in Parallel Shuffle/Group Reduce
Sum()
Sum()
Sum()
Sum()
Sum()
20. Hadoop Security
Centralized framework for collecting access audit history and easy
reporting on the data.
Provides Kerberos based authentication. Kerberos can be
connected to corporate LDAP environments to centrally provision
user information.
Supports encrypting data when it is is transferred and at rest and
masking capalbilities for desenstizing PII information
Ensures users have access to only to data as per corporate policies.
Provides fine-grained authorization via file permissions in HDFS,
recsource level access control for YARN & MapReduce
Security requirements consistently applied across the platform and
can be mangaged centrally with a single interface
Audit
Data Protection
Authorization
Authentication
Centralized Seurity Administration
Difference between Authentication & Authorization ?
21. A Brief note on Spark & Storm
What do you think is the most time consuming aspect of Hadoop Processes?
How to improve the I/O Limitation?
Result: Faster Analytics
How to achieve event driven real
time analytics?
Result: Highly customized
service response
23. Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
24. Data Deluge – A big problem
PC
Tablet Mobile
SocialSearch
&
Mail
E-Commerce
25. Tree based Unstructured Feature Extraction
Panel & Web Logs
Social
Rules
Engine
Data
Parser
• Tweets
• Comments
• Likes
• Shares
• Blogs
• Reviews
• Clickstream
• HTML
• Images
• Audio*
• Video*
Feature Type Detail
Feature 1 Image 600*400
Feature 2 Link #
Feature 3 Price 200$
Feature 4 Star 3.5
Tweet Time View
Tweet1 12:00 Positive
Tweet 2 12:05 Neutral
Tree Based Parser
26. Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
Banyan – A Parallel Processing Framework
6 Demo
27. Banyan – Parallel processing framework at scale
• Is your data unstructured ?
Ex: HTML, Images, URLs, Audio, Video, Documents, Text
• Is processing each input independent of processing other input?
Ex: Compressing one image is independent of next image
• Do you need to solve the two problems above at web scale?
Ex: say 1 Million documents to processed in less than 1 hour
We handle what Hadoop can’t
handle!
Rather, We handle what Hadoop
isn’t supposed to handle – Parallel
Processing & Unstructured data!
Banyan is a parallel processing
framework well integrated with
cloud platform of your choice!
Follow us here:
Banyan – Embarrassingly Parallel
Processing Framework Linked In
Group
http://www.growbanyan.com
Email us :
runparallel@latentview.com
28. Banyan Vs Hadoop (Yes or No type of comparison)
Characteristics Banyan Hadoop
Job Type Embarrassingly Parallel Processing Distributed Processing
Master Slave Architecture
Shared Nothing Architecture
Data Replication
Fault tolerance
Coordinated Job Distribution
Dynamical Load Balancing
Rescheduling Job Failures
Process Structured data
Process Unstructured data
Note:
The core advantage of Banyan is best utilized
when Data Processing & Analysis (Aggregation)
are executed in a decoupled fashion for jobs
that can be processed in parallel