This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.
Implementation of Multi-node Clusters in Column Oriented Database using HDFSIJEACS
Generally HBASE is NoSQL database which runs in the Hadoop environment, so it can be called as Hadoop Database. By using Hadoop distributed file system and map reduce with the implementation of key/value store as real time data access combines the deep capabilities and efficiency of map reduce. Basically testing is done by using single node clustering which improved the performance of query when compared to SQL, even though performance is enhanced, the data retrieval becomes complicated as there is no multi node clusters and totally based on SQL queries. In this paper, we use the concepts of HBase, which is a column oriented database and it is on the top of HDFS (Hadoop distributed file system) along with multi node clustering which increases the performance. HBase is key/value store which is Consistent, Distributed, Multidimensional and Sorted map. Data storage in HBase in the form of cells, and here those cells are grouped by a row key. Hence our proposal yields better results regarding query performance and data retrieval compared to existing approaches.
User can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
Compare and contrast big data processing platforms RDBMS, Hadoop, and Spark. pros and cons of each platform are discussed. Business use cases are also included.
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. AGENDA
Defining the problem – 3Vs
Why traditional storages don’t work
How does Hadoop work?
HDFS (Hadoop 1.0 Vs 2.0)
YARN (2.0)- Yet Another Resource Negotiator
Map Reduce
When we don’t know how to code
Hive (Overview)
PIG (Overview)
Hbase (Overview)
Zookeeper (Overview)
Spark (Overview)
3. DEFINING THE PROBLEM – 3VS
Volume - Lots and lots of data
Datasets are so large and complex
Cannot use relational database
Challenges: capture, curation, storage, search, sharing,
transfer, analysis and visualization.
4. DEFINING THE PROBLEM – 3V (CONTD.)
Velocity - Huge amounts of data generated at
incredible speed
NYSE generates about 1 TB of new trade data per day
AT&T anonymized Call Detail Records (CDRs) top at
around 1 GB per hour.
Variety - Differently formatted data sets from
different sources
Twitter keeps tracks of tweets, Facebook produces
posts and likes data, Youtube streams videos)
5. WHY TRADITIONAL STORAGES DON’T WORK
Unstructured data is exploding, not much of data
produced has relational nature.
No redundancy
High computational cost
Capacity limit for structured data (costly hardware)
Expensive License
Data type Nature
XML Semi-structured
Word docs, PDF files etc. Unstructured
Email body Unstructured
Data from Enterprise Systems
(ERP, CRM etc.)
Structured
9. YARN (2.0)- YET ANOTHER RESOURCE
NEGOTIATOR
Computing framework for Hadoop.
YARN has Resource Manager-
Manages and allocates cluster resources
Improves performance and Quality of Service
10. MAP REDUCE
Programming model in Java
Work on large amounts of data
Provides redundancy & fault tolerance
Runs the code on each data node
11. MAP REDUCE (CONTD.)
Steps for Map Reduce:
Read in lots of data
Map: extract something you care about from each
record/line.
Shuffle and sort
Reduce: aggregate, summarize, filter or transform
Write results.
13. HIVE (OVERVIEW)
Data warehouse infrastructure built on top of Hadoop
Compile SQL queries as MapReduce jobs and run the job in
the cluster.
Brings structure to unstructured data
Key Building Principles:
Structured data with rich data types (structs, lists and maps)
Directly query data from different formats (text/binary) and file
formats (Flat/sequence).
SQL as a familiar programming tool and for standard analytics
Types of applications:
Summarization: Daily/weekly aggregations
Ad hoc analysis
Data Mining
Spam detection
Many more ….
14. PIG (OVERVIEW)
High level dataflow language
Has its own syntax (Preferable for people with
programming background)
Compiler that produces sequences of MapReduce
programs.
Structure is agreeable to substantial parallelization.
Key properties of PIG:
Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
Optimization opportunities: allows user to focus on semantics
rather than efficiency.
Extensibility: Users can create their own functions to do
special purpose processing.
15. HBASE (OVERVIEW)
HBase is a distributed column-oriented data store built
on top of HDFS.
Data is logically organized into tables, rows and
columns.
HDFS is good for batch processing (scan over big files).
Not good for record lookup.
Not good for incremental addition of small batches.
Not good for updates.
HBase is designed to efficiently address the above
points
Fast record lookup
Support for record level insertion
Support for updates (not in place).
Updates are done by creating new versions of values.
16. ZOOKEEPER (OVERVIEW)
Zookeeper is a distributed, open source
coordination service for distributed applications.
Exposes simple set of primitives that distributed
applications can build upon to implement higher
level services for synchronization, configuration
maintenance, and groups and naming.
Coordination services are notoriously hard to get
right. They are prone to errors like race conditions
and deadlock.
The motivation behind zookeeper is to relieve
distributed applications the responsibility of
implementing coordination services from scratch.
17. SPARK (OVERVIEW)
Motivation : MapReduce programming model transform data
flowing from stable storage to stable storage (disk to disk).
Acyclic data flow is a powerful abstraction, but not efficient for
applications that repeatedly reuse a working set of data.
Iterative algorithms
Interactive data mining
Spark makes working sets a first-class concept to efficiently
support these applications.
Goal:
To provide distributed memory abstractions for clusters to support
apps with working sets.
Retain the attractive properties of map reduce.
Fault tolerance
Data locality
Scalability
Augment data flow model with “resilient distributed datasets”
(RDDs)
18. SPARK (OVERVIEW CONTD.)
Resilient distributed datasets (RDDs)
Immutable collections partitioned across cluster that can
be rebuilt if a partition is lost.
Created by transforming data in stable storage using
data flow operators (map, filter, group-by, ..)
Can be cached across parallel operations.
Parallel operations on RDDs.
Reduce, collect, count, save, …..
Restricted shared variables
Accumulators, broadcast variables.
19. SPARK (OVERVIEW CONTD)
Fast map reduce like engine
Uses in memory cluster computing
Compatible with Hadoop storage API.
Has API’s written in Scala, Java, Python.
Useful for large datasets and iterative algorithms.
Up to 40x faster than MapReduce.
Support for:
Spark SQL : Hive on Spark
Mlib : Machine learning library
Graphx : Graph processing.
Founder: Doug Cutting. He named it after his son’s toy elephant
Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.
It is essential to look after the NameNode. Here are some recommendations from production use
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
Use ECC RAM.
On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Configure the NameNode to store one set of transaction logs on a separate disk from the image.
Configure the NameNode to store another set of transaction logs to a network mounted disk.
Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
Do not host DataNode, JobTracker or TaskTracker services on the same system.
DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams).
Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.
It is essential to look after the NameNode. Here are some recommendations from production use
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
Use ECC RAM.
On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Configure the NameNode to store one set of transaction logs on a separate disk from the image.
Configure the NameNode to store another set of transaction logs to a network mounted disk.
Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
Do not host DataNode, JobTracker or TaskTracker services on the same system.
DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams).
Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS).
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.
Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data. Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
Ambari: A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.