"Hadoop and NoSQL: Scalable Back-end Clusters Orchestration in Real-world Systems" was presented in CloudCon2012: BIT’s 1st Annual World Congress of Cloud Computing 2012 will be held from August 28-30, 2012 in Dalian, China
This presentation is the full original presentation of the IPv6Matrix project.
It contains details of the hardware used, as well as the type of data that's archived.
It also contains very useful instructions and tips on how to surf the IPv6Matrix Web site for more data.
On the diversity and availability of temporal information in linked open dataAnisa Rula
This document analyzes temporal information and metadata in linked open data. It examines the extent of temporal data available, how it is represented, and models for representing temporal metadata. The authors conduct a large-scale analysis of over 2 billion triples to characterize temporal properties and entities. They find that while some temporal data exists, coverage is limited. The document also reviews proposed models for representing temporal metadata and concludes with guidelines for consumers and publishers of linked data.
presentation givent at the 2nd International Workshop on Web Intelligence & Virtual Enterprises (WIVE'10) held at the 11th IFIP Working Conference on Virtual Enterprises (PRO-VE'10)
http://www.emse.fr/wive/
A talk at NASA Goddard, February 27, 2013
Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online
The document discusses the Spark ecosystem. It provides an overview of Spark, a cluster computing framework developed at UC Berkeley, including its core components like Resilient Distributed Datasets (RDDs) and projects like Shark. Spark aims to improve on Hadoop and MapReduce by allowing more interactive queries and streaming data analysis through its use of RDDs to cache data in memory across clusters.
The document discusses Hadoop and YARN frameworks. It describes how YARN improves on MapReduce by allowing multiple data processing engines and applications to share common cluster resources. YARN introduces a resource manager and node managers to allocate cores and memory to applications. This allows various data processing paradigms like batch, streaming, interactive SQL and improved data pipelines to leverage a common distributed storage and processing framework across a Hadoop cluster. An example use case at Trovit is described where they process over 7000 jobs daily across various business intelligence, search and analytics applications using YARN and frameworks like Hive, Impala and Storm.
The document discusses graph databases and their advantages over traditional relational databases. It covers the NoSQL movement, graph databases, use cases for graph databases like social networks and semantic web applications. It provides an overview of graph database technologies like Neo4j and DEX and examples of querying and modeling data in a graph database using Neo4j.rb.
Tech4Africa - Opportunities around Big DataSteve Watt
The document discusses big data and techniques for gathering, storing, processing, and delivering large amounts of data at scale. It covers using Apache Nutch to crawl web data, storing data in Apache Hadoop's distributed file system and processing it using MapReduce. For low-latency queries, it recommends column stores like Apache HBase or Apache Cassandra. The document also discusses using machine learning on historical data to build models for real-time decision making, and challenges of processing unstructured data like prose.
This presentation is the full original presentation of the IPv6Matrix project.
It contains details of the hardware used, as well as the type of data that's archived.
It also contains very useful instructions and tips on how to surf the IPv6Matrix Web site for more data.
On the diversity and availability of temporal information in linked open dataAnisa Rula
This document analyzes temporal information and metadata in linked open data. It examines the extent of temporal data available, how it is represented, and models for representing temporal metadata. The authors conduct a large-scale analysis of over 2 billion triples to characterize temporal properties and entities. They find that while some temporal data exists, coverage is limited. The document also reviews proposed models for representing temporal metadata and concludes with guidelines for consumers and publishers of linked data.
presentation givent at the 2nd International Workshop on Web Intelligence & Virtual Enterprises (WIVE'10) held at the 11th IFIP Working Conference on Virtual Enterprises (PRO-VE'10)
http://www.emse.fr/wive/
A talk at NASA Goddard, February 27, 2013
Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online
The document discusses the Spark ecosystem. It provides an overview of Spark, a cluster computing framework developed at UC Berkeley, including its core components like Resilient Distributed Datasets (RDDs) and projects like Shark. Spark aims to improve on Hadoop and MapReduce by allowing more interactive queries and streaming data analysis through its use of RDDs to cache data in memory across clusters.
The document discusses Hadoop and YARN frameworks. It describes how YARN improves on MapReduce by allowing multiple data processing engines and applications to share common cluster resources. YARN introduces a resource manager and node managers to allocate cores and memory to applications. This allows various data processing paradigms like batch, streaming, interactive SQL and improved data pipelines to leverage a common distributed storage and processing framework across a Hadoop cluster. An example use case at Trovit is described where they process over 7000 jobs daily across various business intelligence, search and analytics applications using YARN and frameworks like Hive, Impala and Storm.
The document discusses graph databases and their advantages over traditional relational databases. It covers the NoSQL movement, graph databases, use cases for graph databases like social networks and semantic web applications. It provides an overview of graph database technologies like Neo4j and DEX and examples of querying and modeling data in a graph database using Neo4j.rb.
Tech4Africa - Opportunities around Big DataSteve Watt
The document discusses big data and techniques for gathering, storing, processing, and delivering large amounts of data at scale. It covers using Apache Nutch to crawl web data, storing data in Apache Hadoop's distributed file system and processing it using MapReduce. For low-latency queries, it recommends column stores like Apache HBase or Apache Cassandra. The document also discusses using machine learning on historical data to build models for real-time decision making, and challenges of processing unstructured data like prose.
OTTER is a theorem proving program that uses resolution style logic to analyze problems in first-order logic with equality. It was developed by Larry Wos at Argonne National Laboratory. OTTER works by selecting clauses from its set of support (facts) and set of usable rules to try and derive a contradiction or tautology through techniques like resolution, hyperresolution, and paramodulation. The core of its algorithm involves repeatedly selecting clauses and attempting to combine them until a proof is found or no further progress can be made.
The document discusses OTTER, a theorem prover that uses resolution style proofs. It provides background on OTTER, describing it as a resolution style theorem prover and discussing its clause representation and main strategies like restriction strategies, direction strategies, and look-ahead strategies. Examples are given of using OTTER to solve problems like the 15 puzzle and analyzing its complexity.
High Performance Cyberinfrastructure Enables Data-Driven Science in the Globa...Larry Smarr
10.10.28
Invited Speaker
Grand Challenges in Data-Intensive Discovery Conference
San Diego Supercomputer Center, UC San Diego
Title: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World
La Jolla, CA
The document describes the IPv6 Matrix Project, which tracks IPv6 connectivity worldwide. The project involves running an IPv6 crawler on servers in London to test IPv6 connectivity of popular websites and services. The crawler gathers data that is stored in files and integrated into a database on a web server. This allows the results to be viewed worldwide on the project website at http://www.ipv6matrix.org. The project aims to measure adoption of IPv6 as IP addresses run out.
Grid optical network service architecture for data intensive applicationsTal Lavian Ph.D.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
From Super-computer to Super-network
In the past, computer processors were the fastest part
peripheral bottlenecks
In the future optical networks will be the fastest part
Computer, processor, storage, visualization, and instrumentation - slower "peripherals”
eScience Cyber-infrastructure focuses on computation, storage, data, analysis, Work Flow.
The network is vital for better eScience
8th NCM: 2012 International Conference on Networked Computing and Advanced Information Management
http://www.aicit.org/ncm
April 24-26, 2012, Grand Hilton Hotel (Seoul), Korea
Talk at West Coast Association of Shared Resource DirectorsDeepak Singh
Platforms for data science require rethinking data and compute architectures to handle large-scale data in a distributed, programmable way. Cloud platforms like Amazon Web Services can remove infrastructure constraints by providing on-demand, global, secure, programmable and elastic compute and storage resources. This allows building data science platforms that accept all data formats, have evolvable APIs, and treat data as a programmable resource to enable large-scale data analysis across multiple locations.
Cloud and Grid Integration OW2 Conference Nov10OW2
This document discusses integrating cloud and grid computing resources. It provides background on CDAC, India's national grid called Garuda, and CDAC's private cloud infrastructure. The key requirements for integration are discussed, such as resource management across networks, execution management, monitoring, and security. The scope of integration is described as utilizing unused grid resources through cloud bursting and optimizing utilization of grid resources by deploying cloud applications on the grid. An integration architecture is proposed using an Integrator module to interface scheduling systems and deploy images and applications across grid and cloud resources.
The document proposes a network slicing technique for IEEE 802.11ah networks that dynamically manages radio resources by reconfiguring Restricted Access Window (RAW) parameters over time. A Virtual Network Slicing Broker is introduced as a virtual network function that defines network slices based on service features and quality of service restrictions. It communicates with an IEEE 802.11ah access point to monitor network statistics and enforce slicing configurations using a static or dynamic approach. Simulation results show the dynamic approach allows the broker to reallocate resources between slices by updating the RAW configurations, improving overall network performance.
OTTER is a theorem proving program that uses resolution style logic to analyze problems in first-order logic with equality. It was developed by Larry Wos at Argonne National Laboratory. OTTER works by selecting clauses from its set of support (facts) and set of usable rules to try and derive a contradiction or tautology through techniques like resolution, hyperresolution, and paramodulation. The core of its algorithm involves repeatedly selecting clauses and attempting to combine them until a proof is found or no further progress can be made.
The document discusses OTTER, a theorem prover that uses resolution style proofs. It provides background on OTTER, describing it as a resolution style theorem prover and discussing its clause representation and main strategies like restriction strategies, direction strategies, and look-ahead strategies. Examples are given of using OTTER to solve problems like the 15 puzzle and analyzing its complexity.
High Performance Cyberinfrastructure Enables Data-Driven Science in the Globa...Larry Smarr
10.10.28
Invited Speaker
Grand Challenges in Data-Intensive Discovery Conference
San Diego Supercomputer Center, UC San Diego
Title: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World
La Jolla, CA
The document describes the IPv6 Matrix Project, which tracks IPv6 connectivity worldwide. The project involves running an IPv6 crawler on servers in London to test IPv6 connectivity of popular websites and services. The crawler gathers data that is stored in files and integrated into a database on a web server. This allows the results to be viewed worldwide on the project website at http://www.ipv6matrix.org. The project aims to measure adoption of IPv6 as IP addresses run out.
Grid optical network service architecture for data intensive applicationsTal Lavian Ph.D.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
From Super-computer to Super-network
In the past, computer processors were the fastest part
peripheral bottlenecks
In the future optical networks will be the fastest part
Computer, processor, storage, visualization, and instrumentation - slower "peripherals”
eScience Cyber-infrastructure focuses on computation, storage, data, analysis, Work Flow.
The network is vital for better eScience
8th NCM: 2012 International Conference on Networked Computing and Advanced Information Management
http://www.aicit.org/ncm
April 24-26, 2012, Grand Hilton Hotel (Seoul), Korea
Talk at West Coast Association of Shared Resource DirectorsDeepak Singh
Platforms for data science require rethinking data and compute architectures to handle large-scale data in a distributed, programmable way. Cloud platforms like Amazon Web Services can remove infrastructure constraints by providing on-demand, global, secure, programmable and elastic compute and storage resources. This allows building data science platforms that accept all data formats, have evolvable APIs, and treat data as a programmable resource to enable large-scale data analysis across multiple locations.
Cloud and Grid Integration OW2 Conference Nov10OW2
This document discusses integrating cloud and grid computing resources. It provides background on CDAC, India's national grid called Garuda, and CDAC's private cloud infrastructure. The key requirements for integration are discussed, such as resource management across networks, execution management, monitoring, and security. The scope of integration is described as utilizing unused grid resources through cloud bursting and optimizing utilization of grid resources by deploying cloud applications on the grid. An integration architecture is proposed using an Integrator module to interface scheduling systems and deploy images and applications across grid and cloud resources.
The document proposes a network slicing technique for IEEE 802.11ah networks that dynamically manages radio resources by reconfiguring Restricted Access Window (RAW) parameters over time. A Virtual Network Slicing Broker is introduced as a virtual network function that defines network slices based on service features and quality of service restrictions. It communicates with an IEEE 802.11ah access point to monitor network statistics and enforce slicing configurations using a static or dynamic approach. Simulation results show the dynamic approach allows the broker to reallocate resources between slices by updating the RAW configurations, improving overall network performance.
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
As machine learning moves from niche to mainstream tech stacks how do DevOps engineers prepare for a very different set of problems. A brief look at the new issues that arise from machine learning, an overview of cutting-edge "old school" solutions and how to drag data science (kicking and screaming) into a world of automation.
Video: https://www.youtube.com/watch?v=KHxZCRajRiA
Join DevOps Exchange London here: http://meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter http://www.twitter.com/doxlon
This document discusses high performance cloud computing. It covers 1) cloud infrastructure like EC2 instance types, 2) tools for provisioning and managing cloud resources such as AWS CloudFormation and Chef, 3) example HPC applications that can run on the cloud including genomics tools, and 4) best practices around optimizing costs and resources for high performance computing workloads in the cloud.
Lisa Caywood and Colin Dixon's presentation at the 2017 Open Networking Summit.
OpenDaylight has become a nexus for open source integration, creating a new open networking stack and enabling a new generation of open source, agile IT infrastructure. The fifth “Boron” release provides new tooling and documentation to support application developers, as well as greater integration with industry frameworks from OPNFV and OpenStack to CORD and Atrium. Boron also brings a practical focus on two leading types of deployments: (1) direct control of virtual switches to provide network virtualization and NFV and (2) management and orchestration of existing networks to provide new features and automation. This talk will cover trends in open SDN and cloud networking, with a focus on Boron milestones. In particular, it dives into the architecture across OpenStack and OpenDaylight to enable OpenStack service function chaining support in OpenDaylight.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
Dave Felcey from Oracle will give an overview of Oracle Coherence and releted technologies, like JRockit Real-Time JVM, and discuss how they are being used to address some of the challenges their gaming customers face. In the gaming industry real-time updates and resilience are key. Getting price changes to users by caching data in memory and pushing real-time changes to clients using Coherence can provides a competitive edge and attracts new customers. Increasingly holding data in-memory and using the real-time tools are the only way sites can meet user expectations. However, ensuring in-memory data is resilient under load is also crucial, to protect against costly outages at key times. Dave will discuss the technical details and approaches that can be used to meet these requirements.
This document discusses cloud data center network architectures and how to scale them using Arista switches. It describes the limitations of legacy data center designs and introduces the cloud networking model. The cloud networking model with Arista switches provides benefits like lower latency, no oversubscription between racks, and the ability to scale to hundreds of racks. The document then discusses how to scale the network using layer 2, layer 3, and VXLAN designs from thousands to over a million nodes. It provides examples of scaling the number of leaf and spine switches to achieve greater node counts in a non-blocking two-tier design.
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
The document summarizes various techniques for automated software testing using fuzzing, including coverage-based fuzzing (AFL), directed greybox fuzzing (AflGO), and neural network-based approaches (FuzzGuard). It discusses how genetic algorithms and simulated annealing are used in AFL and AflGO respectively to guide test case mutation towards new code areas. It also provides examples of vulnerabilities found using these fuzzing tools.
「C言語のポインタ(型の変数)は、可変長配列を扱うために使う」という点に絞って、50分間程度の解説をしています。
最終的に下記の12行のプログラムを47分間使って解説します。
(7行目、11行目の”<”は除いています)
1: int size = N;
2: int x[size];
3: int *p;
4:
5: p = x;
6:
7: for ( int = 0; i size; i++)
8: p[i] = i;
9:
10: int y = 0
11: for ( int i = 0; i size; i++)
12: y = y + p[i];
https://www.youtube.com/watch?v=KLFlk1dohKQ&t=1496s
1. The model is a polynomial regression model that fits a polynomial function to the training data.
2. The loss function used is the sum of squares of the differences between the predicted and actual target values.
3. The optimizer used is GradientDescentOptimizer which minimizes the loss function to fit the model parameters.
1. Hadoop and NoSQL: Scalable
Back-end Clusters Orchestration
in Real-world Systems
CloudCon 2012, Dalian, China
Ruo Ando
NICT National Institute of
Information and Communications
Technology, Tokyo Japan
2. Agenda: Scalable Back-end Clusters Orchestration
for real-world systems (large scale network monitoring)
■Hadoop and NoSQL: Scalable Back-end Clusters Orchestration in Real-world Systems
Hadoop and NoSQl are usually used together. Partly because Key-Value data format (such as JSON)
is suitable for exchanging data between MongoDB and HDFS. These technologies is deployed network
monitoring system and large scale Testbed in National research institute in Japan.
■What is Orchestration is for? – large scale network monitoring
With rapid enragement of botNet and file sharing networks, network traffic monitoring logs has become
“big data”. Today’s Large scale network monitoring needs scalable clusters for traffic logging and data
processing.
■Back ground – Internet traffic explosion
Some statistics are shown about mobile phone traffic and "gigabyte club"
■Real world systems – large scale DHT network crawling
To test performance of our system, we have crawling DHT (BitTorrent) network. Our system have obtained
information of over 10,000,000 nodes in 24 hours. Besides, ranking of countries about popularity of DHT
network is generated by our HDFS.
■Architecture overview
We use everything available for constructing high-speed and scalable clusters (hypervisor, NoSQL, HDFS,
Scala, etc..)
■Map Reduce and Traffic logs
For aggregating and sorting traffic logs, we have programmed two stage Map Reduce.
■Results and demos
■conclusion
3. NICT: National Institute of Information and Solar observatory
Communications Technology, Tokyo Japan
Large scale TestBeds
Large scale network emulation for
analyzing cyber incidents (DDOS, BotNet)
We have over
140,000 passive
monitor in Darknet
for analyzing botNet
Darknet monitoring for malware analysis
4. StarBed:A Large Scale Network Experiment
Environment in NICT
• Developers along desire to evaluate their new
technologies in realistic situations. The developers for the
Internet are not excepted. The general experimental
issues for Internet technologies are efficiency and
scalability. StarBED enables to evaluate such factors in
realistic situations.
• Actual computers and network equipments are required if
we want to evaluate software for the real Internet. In
StarBED there are many actual computers, and switches
which connect these computers. We reproduce close to
reality situations with actual equipments that are used on
Internet. If developers want to evaluate their real
implementation, they have to use actual equipments.
group # of experiment networks
F 168 0 0 4 SATA 2006
H 240 0 0 2 SATA 2009
I 192 0 0 4 SATA 2011
J 96 0 0 4 SATA 2011 There are about 1000 servers.
Other 500 StarBed collaborates with other testbed project of
total 960 DETER, PlanetLab in US.
Group I,J,K,L Model Cisco UCS C200 M2 CPU Intel 6-Core Xeon X5670 x 2
Memory 48.0GB Disk SATA 500GB x 2 Network (on-board) double GigabitEthernet
5. Real world systems: monitoring Bittorrent network -
handling massive DHT crawling
Invisibility (thus unstoppable)
encourages illegal adoption of
DHT network Bit Torrent traffic rate of all internet
estimates
In 2010 Oct, A New York judge ordered LimeWire ① “55%” - CableLabs
to shutdown its file-sharing software. About an half of upstream traffic of CATV.
US federal court judge issued that Limewire’s ② “35%” - CacheLogic
service is used as one of the software for “LIVEWIRE - File-sharing network thrives beneath
infringement of copyright contents. the Radar”
Later soon, the new version of Limewire called ③ “60%” - documents in www.sans.edu
LPE (Limewire Pirate Edition) has been released “It is estimated that more than 60% of the traffic on
as resurrection by anonymous creators. the internet is peer-to-peer.”
6. Parser and translator is
Architecture Overview parallelized by Scala.
Virtual machines and Data nodes is applicable for scaling out.
8. Demo: visualizing propagation of DHT crawling
We have
crawled
more than
10,000,000
Peers in
DHT nework
In 24 hours
SQL (MySQL
or Postgres)
Cannot
handle
4,000,000
peers in
3 hours !
9. DHT crawler and Map Reduce
For huge scale of DHT network, we cannot Without HDFS, it takes 7 days for
run too many crawlers. processing data of 1 day.
RANK Country # of nodes Region Domain
1 Russia 1,488,056 Russia RU
2 United states 1,177,766 North America US
3 China 815,934 East Asia CN
4 UK 414,282 West Europe GB
5 Canada 408,592 North America CA
6 Ukraine 399,054 East Europe UA
7 France 394,005 West Europe FR
8 India 309,008 South Asia IN
9 Taiwan 296,856 East Asia TW
DHT network
10 Brazil 271,417 South America BR
11 Japan 262,678 East Asia JP
12 Romania 233,536 East Europe RO
13 Bulgaria 226,885 East Europe BG
14 South Korea 217,409 East Asia KR
15 Australia 216,250 Oceania AU
16 Poland 184,087 East Europe PL
17 Sweden 183,465 North Europe SE
18 Thailand 183,008 South East Asia TH
19 Italy 177,932 West Europe IT
20 Spain 172,969 West Europe ES
Reduce
DHT Crawler DHT Crawler DHT Crawler
Shuffle
Scale out !
Map Map Map
Key value store
<key>=node ID
<value>=data (address, port, etc)
Dump Data
Map job should be increased
corresponding to the number of DHT crawler.
10. Scaling DHT crawlers out!
FIND_NODE : used to obtain the
contact information of ID.
Response should be a key “nodes” or the compact node
info for the target node or the K (8) in its routing table.
arguments: {"id" : "<querying nodes id>",
"target" : "<id of target node>"}
response: {"id" : "<queried nodes id>",
"nodes" : "<compact node info>"}
DHT network
The response should be a key nodes of
or the compact node info for the target node
or the K (8) in its routing table.
DHT Crawler DHT Crawler DHT Crawler
Info of key nodes and K(8) should be
Hypervisor randomly distributed.
So we can obtain 8^N peers in worst case.
11. Rapid propagation of
DHT gossip protocol N^M
node
12000000
10000000
8000000
6000000
4000000
2000000
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
diff
1000000
Applying 100000
gossip
protocol, 10000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
DHT has
N^M (N=5-8) After 5 hours, Δ(increasing)
propagation become stable
speed.
In first 4 hours, we can obtain
more than 4000000 peers!
12. Visualization & ranking
77.221.39.201,6881,2011/9/25 23:57:43,1
87.97.210.128,62845,2011/9/25 23:56:32,1
188.40.33.212,6881,2011/9/25 23:33:58,1
188.232.9.21,49924,2011/9/25 23:37:02,1
Traffic logs
is parsed
Into XML
Location info is (Keyhole
IP address retrieved by GeoIP Time
Markup
from each IP address
Language)
Location Info
Domain name (country, city, latlng)
KML movie
Strings are tokenized Figure
and aggregated
ranking by HDFS
13. Two-Stage Map Reduce: count and sorting
Frequency count Sorting according
for each word to Reduce1
Map
Reduce1 Map
Input Map Reduce Output
Reduce2 Map
Map
MapReduce is the algorithm suitable for coping with Big data.
Ranking (sorting)
map(key1,value) -> list<key2,value2> Need second stage
reduce(key2, list<value2>) -> list<value3> of Map phase.
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
14. Map Phase
*.0.194.107,h116-0-194-107.catv02.itscom.jp
*.28.27.107,c-76-28-27-107.hsd1.ct.comcast.net
*.40.239.181,c-68-40-239-181.hsd1.mi.comcast.net
*.253.44.184,pool-96-253-44-184.prvdri.fios.verizon.net
*.27.170.168,cpc11-stok15-2-0-cust167.1-4.cable.virginmedia.com
*.22.23.81,cpc2-stkn10-0-0-cust848.11-2.cable.virginmedia.com
*.0.194.107 hdsl1 comcast hdsl1 comcast verizon virginmedia
1 1 1 1 1 1 1
Log string is divided into words and assigned “1”.
key-value – {word, 1} Map job is
easier to increase
In Map phase, each line is tokenized for a word, and each word then Reduce job.
is assigned “1”.
15. Reduce Phase
*.0.194.107 hdsl1 comcast hdsl1 comcast verizon virginmedia
1 1 1 1 1 1 1
hdsl1 comcast verizon
1 1 1
1 1
Reduce job is applied for
counting frequency of each word.
Reduce: count up 1 for each word.
Key-value – {hdsl, 2} / Key-value – {comcast, 2} / Key-value – {verizon, 1}
16. Sorting and ranking
*.0.194.107 hdsl1 comcast hdsl1 comcast verizon hdsl1
1 1 1 1 1 1 1
hdsl1 comcast verizon
1 1 1
1 1
①
Sorting and ranking is
1
③ ② second reduce phase.
Words with the frequency
is sorted in shuffle.
@list1 = reverse sort { (split(/¥s/,$a))[1] <=> (split(/¥s/,$b))[1] } @list1;
17. Example: # of nodes Ranking in one day
RANK Country # of nodes Region Domain
1 Russia 1,488,056 Russia RU
2 United states 1,177,766 North America US
3 China 815,934 East Asia CN
4 UK 414,282 West Europe GB
5 Canada 408,592 North America CA
6 Ukraine 399,054 East Europe UA
7 France 394,005 West Europe FR
8 India 309,008 South Asia IN
9 Taiwan 296,856 East Asia TW
10 Brazil 271,417 South America BR
11 Japan 262,678 East Asia JP
12 Romania 233,536 East Europe RO
13 Bulgaria 226,885 East Europe BG
14 South Korea 217,409 East Asia KR
15 Australia 216,250 Oceania AU
16 Poland 184,087 East Europe PL
17 Sweden 183,465 North Europe SE
18 Thailand 183,008 South East Asia TH
19 Italy 177,932 West Europe IT
20 Spain 172,969 West Europe ES
18. ALL cities except US
N/A 978457
1 Moscow 285097 (RU:1)
2 Beijing 240419 (CN:3)
3 Seoul 180186 (KR)
4 Taipei 161498 (TW:9)
5 Kiev 117392 (RU:1)
6 Saint Petersburg 94560
7 Bucharest 79336
These peers has 8 Sofia 78445 (BG:13)
been connected from 9 Central District 65635 (HK)
single point in Tokyo in
24 hours. Propagation
10 Bangkok 62882 (TH:18)
in DHT network is 11 Delhi 62563 (IN:8)
beyond over 12 Tokyo 54531 (JP:11)
boarder control. 13 London 53514 (GB:4)
14 Guangzhou 52981 (CN:3)
15 Athens 52656 (3680000: 1.4%)
16 Budapest 52031
Z. N. J. Peterson, M. Gondree, and R. Beverly.
A position paper on data sovereignty:
The importance
of geolocating data in the cloud.
the 3nd USENIX workshop on Hot Topics in
Cloud Computing, June 2011
19. rank 3 China 815,934 East Asia CN
name # of peers population 都市名
Beijing 240419 1755 北京
Guangzhou 52981 1,004 広州
Shanghai 27399 1921 上海
Jinan 26281 569 済南
Chengdu 18835 1059 成都
Shenyang 18566 776 瀋陽
Tianjin 18460 1228 天津
Hebei 17414 - 河北
Wuhan 15239 910 武漢
Hangzhou 12997 796 杭州
Harbin 10848 987 ハルビン
Changchun 10411 751 長春
Nanning 10318 648 南寧
Beijing is the largest city of which the Qingdao 10257 757 青島
number of peers is about 240000, second
to Moscow.
Tokyo 54531 1318 東京
In china, BT seems to be popular besides Osaka 7430 886 大阪
many domestic file sharing systems. yokohama 6983 369 横浜
BitComet: a popular Tokyo and Guangzhou has almost the same
number of peers about 50000.
client in Asia
20. Demo2: (almost) real time monitoring of peers
in Japan
In this movie,
there are
four colors
According to
the number
of files
located in
each point.
In this slide, traffic log
is translated into XML
Key hole markup
Language
Movie can be generated after a day. Spying the World from your Laptop -- Identifying
and Profiling Content Providers and
Aggregation and translation of 24 hours is Big Downloaders in BitTorrent
completed in 16 hours 3rd USENIX Workshop on Large-Scale Exploits
and Emergent Threats (LEET'10) (2010)
21. Conclusion: Scalable Back-end Clusters Orchestration
for real-world systems (large scale network monitoring)
■Hadoop and NoSQL: Scalable Back-end Clusters Orchestration in Real-world Systems
Hadoop and NoSQl are usually used together. Partly because Key-Value data format (such as JSON)
is suitable for exchanging data between MongoDB and HDFS. These technologies is deployed
network monitoring system and large scale Testbed in National research institute in Japan.
■What is Orchestration is for? – large scale network monitoring
With rapid enragement of botNet and file sharing networks, network traffic monitoring logs has
become “big data”. Today’s Large scale network monitoring needs scalable clusters for traffic
logging and data processing.
■Back ground – Internet traffic explosion
Some statistics are shown about mobile phone traffic and "gigabyte club"
■Real world systems – large scale DHT network crawling
To test performance of our system, we have crawling DHT (BitTorrent) network. Our system have
Obtained information of over 10,000,000 nodes in 24 hours. Besides, ranking of countries about
popularity of DHT network is generated by our HDFS.
■Architecture overview
We use everything available for constructing high-speed and scalable clusters (hypervisor, NoSQL,
HDFS, Scala, etc..)
■Map Reduce and Traffic logs
For aggregating and sorting traffic logs, we have programmed two stage Map Reduce.