The document provides instructions for running Hadoop in standalone, pseudo-distributed, and fully distributed modes. It discusses downloading and installing Hadoop, configuring environment variables and files for pseudo-distributed mode, starting Hadoop daemons, and running a sample word count MapReduce job locally to test the installation.
This document discusses various topics in bioinformatics and Biopython:
1. It introduces GitHub as a code hosting site and shows how to access a private GitHub repository.
2. It covers various Python control structures (if/else, while, for) and data structures (lists, dictionaries).
3. It provides examples of using Biopython to work with biological sequences, including translating DNA to protein, finding complements, and working with different genetic codes.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop includes HDFS for distributed storage, and MapReduce for distributed processing. Other Hadoop projects include Pig for data flows, ZooKeeper for coordination, and YARN for job scheduling. Key Hadoop daemons include the NameNode, Secondary NameNode, DataNodes, JobTracker and TaskTrackers.
The document provides descriptions of various components in Hadoop including Hadoop Core, Pig, ZooKeeper, JobTracker, TaskTracker, NameNode, Secondary NameNode, and the design of HDFS. It also discusses how to deploy Hadoop in a distributed environment and configure core-site.xml, hdfs-site.xml, and mapred-site.xml.
The document discusses the glance-replicator tool in OpenStack. Glance-replicator allows replication of images between two glance servers. It can replicate images and also import and export images. The document provides examples of using glance-replicator commands like compare, livecopy to replicate images between two devstack all-in-one OpenStack environments. It demonstrates the initial state with only one environment having images and after replication both environments having the same set of images.
This document provides an introduction and quick start guide for Apache Tajo. It outlines how to install Tajo in local or distributed mode, configure basic settings, and launch a Tajo cluster on a single machine or across multiple machines. The document also introduces the speaker and provides contact information.
The document describes the steps to set up a Hadoop cluster with one master node and three slave nodes. It includes installing Java and Hadoop, configuring environment variables and Hadoop files, generating SSH keys, formatting the namenode, starting services, and running a sample word count job. Additional sections cover adding and removing nodes and performing health checks on the cluster.
This document provides information on using Perl to interact with and manipulate databases. It discusses:
- Using the DBI module to connect to databases in a vendor-independent way
- Installing Perl modules like DBI and DBD drivers to connect to specific databases like Postgres
- Preparing the Postgres database environment, including initializing and starting the database
- Using the DBI handler and statements to connect to and execute queries on the database
- Retrieving and manipulating database records through functions like SELECT, adding new records, etc.
The document provides code examples for connecting to Postgres with Perl, executing queries to retrieve data, and manipulating the database through operations like inserting new records. It focuses on
This document discusses various topics in bioinformatics and Biopython:
1. It introduces GitHub as a code hosting site and shows how to access a private GitHub repository.
2. It covers various Python control structures (if/else, while, for) and data structures (lists, dictionaries).
3. It provides examples of using Biopython to work with biological sequences, including translating DNA to protein, finding complements, and working with different genetic codes.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop includes HDFS for distributed storage, and MapReduce for distributed processing. Other Hadoop projects include Pig for data flows, ZooKeeper for coordination, and YARN for job scheduling. Key Hadoop daemons include the NameNode, Secondary NameNode, DataNodes, JobTracker and TaskTrackers.
The document provides descriptions of various components in Hadoop including Hadoop Core, Pig, ZooKeeper, JobTracker, TaskTracker, NameNode, Secondary NameNode, and the design of HDFS. It also discusses how to deploy Hadoop in a distributed environment and configure core-site.xml, hdfs-site.xml, and mapred-site.xml.
The document discusses the glance-replicator tool in OpenStack. Glance-replicator allows replication of images between two glance servers. It can replicate images and also import and export images. The document provides examples of using glance-replicator commands like compare, livecopy to replicate images between two devstack all-in-one OpenStack environments. It demonstrates the initial state with only one environment having images and after replication both environments having the same set of images.
This document provides an introduction and quick start guide for Apache Tajo. It outlines how to install Tajo in local or distributed mode, configure basic settings, and launch a Tajo cluster on a single machine or across multiple machines. The document also introduces the speaker and provides contact information.
The document describes the steps to set up a Hadoop cluster with one master node and three slave nodes. It includes installing Java and Hadoop, configuring environment variables and Hadoop files, generating SSH keys, formatting the namenode, starting services, and running a sample word count job. Additional sections cover adding and removing nodes and performing health checks on the cluster.
This document provides information on using Perl to interact with and manipulate databases. It discusses:
- Using the DBI module to connect to databases in a vendor-independent way
- Installing Perl modules like DBI and DBD drivers to connect to specific databases like Postgres
- Preparing the Postgres database environment, including initializing and starting the database
- Using the DBI handler and statements to connect to and execute queries on the database
- Retrieving and manipulating database records through functions like SELECT, adding new records, etc.
The document provides code examples for connecting to Postgres with Perl, executing queries to retrieve data, and manipulating the database through operations like inserting new records. It focuses on
Elasticsearch allows users to group related data into logical units called indices. An index can be defined using the create index API and documents are indexed to an index. Indices are partitioned into shards which can be distributed across multiple nodes for scaling. Each shard is a standalone Lucene index. Documents must be in JSON format with a unique ID and can contain any text or numeric data to be searched or analyzed.
Java 7 was released in mid-2011 with some new features but missing others that were postponed to Java 8. The key new features in Java 7 included strings in switch statements, try-with-resources for simpler resource management, multi-catch exceptions, binary literals, and the Fork/Join framework for parallel programming. The invokedynamic bytecode instruction and related APIs also provided better support for dynamic languages on the JVM. Some planned language enhancements from Project Coin were postponed, while others like closures were delayed until Java 8.
The document discusses several new features and enhancements in Java 7 including the Fork/Join framework for taking advantage of multiple processors, the new NIO 2.0 file system API for asynchronous I/O and custom file systems, support for dynamic languages, try-with-resources statement for improved exception handling, and other minor improvements like underscores in literals and strings in switch statements.
This document provides an overview of using Python and MongoDB together. It discusses MongoDB concepts and architecture, how to get started with MongoDB using the interactive console, and basic CRUD operations. It then covers installing and using PyMongo, the main Python driver for MongoDB, and some popular high-level Python frameworks built on top of PyMongo like MongoEngine and MongoAlchemy.
Active Software Documentation using Soul and IntensiVEkim.mens
The document discusses using logic metaprogramming for active software documentation, specifically using Soul and IntensiVE. It was created on March 30, 2009 by Johan Brichau, Kim Mens, Coen De Roover, Andy Kellens, and Roel Wuyts from the Computer Science Department of the Catholic University of Louvain. The document provides an introduction to documenting software using logic metaprogramming techniques.
Sesame is an open source Java framework for storing and querying RDF data. It provides a repository API for programmatic access and tools like a command line console and web-based workbench. The repository API offers methods for adding, querying, and deleting RDF data through repository connections. Sesame supports various repository implementations including in-memory, native on-disk, and remote repositories accessed over HTTP. Transactions allow grouping operations and rolling back on failure.
Thrift and PasteScript are frameworks for building distributed applications and services. Thrift allows defining data types and interfaces using a simple definition language that can generate code in multiple languages. It uses a compact binary protocol for efficient RPC-style communication between clients and servers. PasteScript builds on WSGI and provides tools like paster for deploying and managing Python web applications, along with reloading and logging capabilities. It integrates with Thrift via server runners and application factories.
MySQL Slow Query log Monitoring using Beats & ELKI Goo Lee
This document provides instructions for using Filebeat, Logstash, Elasticsearch, and Kibana to monitor MySQL slow query logs. It describes installing and configuring each component, with Filebeat installed on database servers to collect slow query logs, Logstash to parse and index the logs, Elasticsearch for storage, and Kibana for visualization and dashboards. Key steps include configuring Filebeat to ship logs to Logstash, using grok filters in Logstash to parse the log fields, outputting to Elasticsearch, and visualizing slow queries and creating sample dashboards in Kibana.
eZ Cluster allows running an eZ Publish installation on multiple servers for improved performance, redundancy, and scalability. It matches the database storage for metadata with either database or network file system storage for content files. The cluster handlers store metadata in the database and files either in the database or on an NFS server. Configuration involves setting the cluster handler, storing files on the database or NFS, moving existing files to the cluster, rewriting URLs, and indexing binary files. The cluster API provides methods for reading, writing, and caching files while handling concurrency and stale caching.
This document discusses Biopython, a Python package for biological data analysis. It provides concise summaries of key Biopython concepts:
1) Biopython is an object-oriented Python package that consists of modules for common biological data operations like working with sequences.
2) Key Biopython classes include Alphabet for sequence alphabets, Seq for representing sequences, SeqRecord for sequences with metadata, and SeqIO for reading/writing sequences to files.
3) Classes specify attributes (data) and methods (functions) that objects can have. For example, Seq objects have attributes like sequence and alphabet, and methods like translate() and complement().
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
The document contains code for a client-server chat application written in Java. It defines classes for the client, server, and thread handling client connections. The client and server classes establish sockets to communicate. The server accepts new connections in a loop, passing each to a ServerThread class that handles the communication with that client using input/output streams. The client also uses input/output streams to continuously send user input to the server and receive responses, allowing for a back-and-forth chat.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Oracle applications 11i hot backup cloning with rapid cloneDeepti Singh
This document provides instructions for cloning an Oracle Applications 11i environment from a production system called PRODSERVER to a test system called TESTSERVER using Rapid Clone hot backup methodology. It involves 7 stages: 1) preparing the source system, 2) putting the database in backup mode and copying files, 3) copying application files, 4) copying files to the target, 5) configuring the target database, 6) configuring the target application tier, and 7) finishing tasks like updating profiles. Key steps include applying required patches, running preclone scripts, copying database and application files, recovering the database using the backup control file, and configuring the cloned application and database tiers.
This document provides an overview of how to set up and manage a MongoDB sharded cluster. It describes the key components of a sharded cluster including shards, config servers, and mongos query routers. It then provides step-by-step instructions for deploying, upgrading, and troubleshooting a sharded cluster. The document explains how to configure shards, config servers, and mongos processes. It also outlines best practices for upgrading between minor and major versions of MongoDB.
This document summarizes the key features and changes in PostgreSQL 9.0 beta release. It highlights major new features like replication, permissions, and anonymous code blocks. It also briefly outlines many other enhancements, including performance improvements, monitoring tools, JSON/XML output for EXPLAIN, and mobile app contest. The presentation aims to excite developers about trying the new beta version.
The document outlines steps for developing an effective digital marketing strategy for arts organizations. It recommends analyzing an organization's current digital presence, goals, and audience insights. The strategy should position digital marketing to support the overall mission and marketing approach. It then provides a planning framework to determine objectives, tactics, responsibilities, and key performance indicators to monitor success and drive continuous improvement through a feedback loop. The overall message is that a well-executed digital strategy can help arts organizations engage more people.
The document discusses setting up Hadoop on a multi-node cluster. It goes through steps such as installing Java, downloading and extracting Hadoop, configuring nodes, formatting the HDFS, and starting processes on all nodes. Commands are shown to check the Hadoop version, run examples, and view logs.
This document summarizes Doug Cutting's presentation on using Hadoop for scalable web crawling and indexing with the Nutch project. It describes how Nutch algorithms like crawling, parsing, link inversion, and indexing were converted to MapReduce jobs that can scale to billions of web pages. The document outlines the key Nutch algorithms and how they were adapted to the Hadoop framework using MapReduce.
Apache Ant is a build tool used primarily for building and deploying Java projects. It uses an XML file called build.xml for configuration. Builds in Ant are composed of tasks, targets, and extension points. Tasks are atomic units of work, like compiling source code. Targets can specify dependencies between tasks. The default target will run if none is specified. Ant allows classpath containers to be created and used in tasks. It also defines a JUnit task to run JUnit tests.
Apache Hadoop is an open-source software framework that supports large-scale distributed applications and processing of multi-petabyte datasets across thousands of commodity servers. It implements the MapReduce programming model for distributed processing and the Hadoop Distributed File System (HDFS) for reliable data storage. HDFS stores data across commodity servers, provides high aggregate bandwidth, and detects/recovers from failures automatically.
Elasticsearch allows users to group related data into logical units called indices. An index can be defined using the create index API and documents are indexed to an index. Indices are partitioned into shards which can be distributed across multiple nodes for scaling. Each shard is a standalone Lucene index. Documents must be in JSON format with a unique ID and can contain any text or numeric data to be searched or analyzed.
Java 7 was released in mid-2011 with some new features but missing others that were postponed to Java 8. The key new features in Java 7 included strings in switch statements, try-with-resources for simpler resource management, multi-catch exceptions, binary literals, and the Fork/Join framework for parallel programming. The invokedynamic bytecode instruction and related APIs also provided better support for dynamic languages on the JVM. Some planned language enhancements from Project Coin were postponed, while others like closures were delayed until Java 8.
The document discusses several new features and enhancements in Java 7 including the Fork/Join framework for taking advantage of multiple processors, the new NIO 2.0 file system API for asynchronous I/O and custom file systems, support for dynamic languages, try-with-resources statement for improved exception handling, and other minor improvements like underscores in literals and strings in switch statements.
This document provides an overview of using Python and MongoDB together. It discusses MongoDB concepts and architecture, how to get started with MongoDB using the interactive console, and basic CRUD operations. It then covers installing and using PyMongo, the main Python driver for MongoDB, and some popular high-level Python frameworks built on top of PyMongo like MongoEngine and MongoAlchemy.
Active Software Documentation using Soul and IntensiVEkim.mens
The document discusses using logic metaprogramming for active software documentation, specifically using Soul and IntensiVE. It was created on March 30, 2009 by Johan Brichau, Kim Mens, Coen De Roover, Andy Kellens, and Roel Wuyts from the Computer Science Department of the Catholic University of Louvain. The document provides an introduction to documenting software using logic metaprogramming techniques.
Sesame is an open source Java framework for storing and querying RDF data. It provides a repository API for programmatic access and tools like a command line console and web-based workbench. The repository API offers methods for adding, querying, and deleting RDF data through repository connections. Sesame supports various repository implementations including in-memory, native on-disk, and remote repositories accessed over HTTP. Transactions allow grouping operations and rolling back on failure.
Thrift and PasteScript are frameworks for building distributed applications and services. Thrift allows defining data types and interfaces using a simple definition language that can generate code in multiple languages. It uses a compact binary protocol for efficient RPC-style communication between clients and servers. PasteScript builds on WSGI and provides tools like paster for deploying and managing Python web applications, along with reloading and logging capabilities. It integrates with Thrift via server runners and application factories.
MySQL Slow Query log Monitoring using Beats & ELKI Goo Lee
This document provides instructions for using Filebeat, Logstash, Elasticsearch, and Kibana to monitor MySQL slow query logs. It describes installing and configuring each component, with Filebeat installed on database servers to collect slow query logs, Logstash to parse and index the logs, Elasticsearch for storage, and Kibana for visualization and dashboards. Key steps include configuring Filebeat to ship logs to Logstash, using grok filters in Logstash to parse the log fields, outputting to Elasticsearch, and visualizing slow queries and creating sample dashboards in Kibana.
eZ Cluster allows running an eZ Publish installation on multiple servers for improved performance, redundancy, and scalability. It matches the database storage for metadata with either database or network file system storage for content files. The cluster handlers store metadata in the database and files either in the database or on an NFS server. Configuration involves setting the cluster handler, storing files on the database or NFS, moving existing files to the cluster, rewriting URLs, and indexing binary files. The cluster API provides methods for reading, writing, and caching files while handling concurrency and stale caching.
This document discusses Biopython, a Python package for biological data analysis. It provides concise summaries of key Biopython concepts:
1) Biopython is an object-oriented Python package that consists of modules for common biological data operations like working with sequences.
2) Key Biopython classes include Alphabet for sequence alphabets, Seq for representing sequences, SeqRecord for sequences with metadata, and SeqIO for reading/writing sequences to files.
3) Classes specify attributes (data) and methods (functions) that objects can have. For example, Seq objects have attributes like sequence and alphabet, and methods like translate() and complement().
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
The document contains code for a client-server chat application written in Java. It defines classes for the client, server, and thread handling client connections. The client and server classes establish sockets to communicate. The server accepts new connections in a loop, passing each to a ServerThread class that handles the communication with that client using input/output streams. The client also uses input/output streams to continuously send user input to the server and receive responses, allowing for a back-and-forth chat.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Oracle applications 11i hot backup cloning with rapid cloneDeepti Singh
This document provides instructions for cloning an Oracle Applications 11i environment from a production system called PRODSERVER to a test system called TESTSERVER using Rapid Clone hot backup methodology. It involves 7 stages: 1) preparing the source system, 2) putting the database in backup mode and copying files, 3) copying application files, 4) copying files to the target, 5) configuring the target database, 6) configuring the target application tier, and 7) finishing tasks like updating profiles. Key steps include applying required patches, running preclone scripts, copying database and application files, recovering the database using the backup control file, and configuring the cloned application and database tiers.
This document provides an overview of how to set up and manage a MongoDB sharded cluster. It describes the key components of a sharded cluster including shards, config servers, and mongos query routers. It then provides step-by-step instructions for deploying, upgrading, and troubleshooting a sharded cluster. The document explains how to configure shards, config servers, and mongos processes. It also outlines best practices for upgrading between minor and major versions of MongoDB.
This document summarizes the key features and changes in PostgreSQL 9.0 beta release. It highlights major new features like replication, permissions, and anonymous code blocks. It also briefly outlines many other enhancements, including performance improvements, monitoring tools, JSON/XML output for EXPLAIN, and mobile app contest. The presentation aims to excite developers about trying the new beta version.
The document outlines steps for developing an effective digital marketing strategy for arts organizations. It recommends analyzing an organization's current digital presence, goals, and audience insights. The strategy should position digital marketing to support the overall mission and marketing approach. It then provides a planning framework to determine objectives, tactics, responsibilities, and key performance indicators to monitor success and drive continuous improvement through a feedback loop. The overall message is that a well-executed digital strategy can help arts organizations engage more people.
The document discusses setting up Hadoop on a multi-node cluster. It goes through steps such as installing Java, downloading and extracting Hadoop, configuring nodes, formatting the HDFS, and starting processes on all nodes. Commands are shown to check the Hadoop version, run examples, and view logs.
This document summarizes Doug Cutting's presentation on using Hadoop for scalable web crawling and indexing with the Nutch project. It describes how Nutch algorithms like crawling, parsing, link inversion, and indexing were converted to MapReduce jobs that can scale to billions of web pages. The document outlines the key Nutch algorithms and how they were adapted to the Hadoop framework using MapReduce.
Apache Ant is a build tool used primarily for building and deploying Java projects. It uses an XML file called build.xml for configuration. Builds in Ant are composed of tasks, targets, and extension points. Tasks are atomic units of work, like compiling source code. Targets can specify dependencies between tasks. The default target will run if none is specified. Ant allows classpath containers to be created and used in tasks. It also defines a JUnit task to run JUnit tests.
Apache Hadoop is an open-source software framework that supports large-scale distributed applications and processing of multi-petabyte datasets across thousands of commodity servers. It implements the MapReduce programming model for distributed processing and the Hadoop Distributed File System (HDFS) for reliable data storage. HDFS stores data across commodity servers, provides high aggregate bandwidth, and detects/recovers from failures automatically.
The title "Big Data using Hadoop.pdf" suggests that the document is likely a PDF file that focuses on the utilization of Hadoop technology in the context of Big Data. Hadoop is a popular open-source framework for distributed storage and processing of large datasets. The document is expected to cover various aspects of working with big data, emphasizing the role of Hadoop in managing and analyzing vast amounts of information.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It utilizes HDFS for storage, which distributes data across nodes and replicates files for fault tolerance. HDFS uses a master/slave architecture, with a NameNode managing the file system namespace and DataNodes storing file data in blocks. The Hadoop API provides access to HDFS through interfaces like FileSystem and FSDataInputStream, allowing applications to read, write, and manipulate data in a distributed manner.
Hadoop Papyrus is an open source project that allows Hadoop jobs to be run using a Ruby DSL instead of Java. It reduces complex Hadoop procedures to just a few lines of Ruby code. The DSL describes the Map, Reduce, and Job details. Hadoop Papyrus invokes Ruby scripts using JRuby during the Map/Reduce processes running on the Java-based Hadoop framework. It also allows writing a single DSL script to define different processing for each phase like Map, Reduce, or job initialization.
Node.js is a JavaScript runtime built on Chrome's V8 engine. It allows JavaScript to run on the server-side and is used for building network applications. Some key points about Node.js include:
- It uses an event-driven, non-blocking I/O model that makes it lightweight and efficient.
- Node package manager (npm) allows installation of external packages and libraries.
- Modules are used to organize code into reusable pieces and can be local or installed via npm.
- Testing frameworks like Mocha allow writing unit tests for modules and APIs.
Create & Execute First Hadoop MapReduce Project in.pptxvishal choudhary
The document provides a 12 step guide to create and execute a first Hadoop MapReduce project in Eclipse. The steps include installing prerequisites like Hadoop, Eclipse, and Java, creating a project in Eclipse, adding required Hadoop jar files, creating Mapper, Reducer and Driver classes, compiling the code into a jar file, and executing the MapReduce job on Hadoop by running the jar file.
The document provides an overview of using the Java API to interact with HDFS. It discusses creating a FileSystem object using the Configuration, opening an InputStream to read from HDFS files, and using IOUtils for easy copying and closing of streams. Code examples are provided for listing HDFS contents, loading configurations, and reading a file from HDFS.
Nov. 4, 2011 o reilly webcast-hbase- lars georgeO'Reilly Media
HBase Coprocessors allow user code to be deployed directly on HBase clusters. Coprocessors run within each region of a table and define an interface for client calls. Examples of coprocessors include distributed query processing and regular expression search. Coprocessors are loaded via configuration or table schema and provide hooks into various HBase operations like get, put, and scan calls as well as lifecycle events.
This document contains code for a basic word count program written in Java using Apache Spark. It defines Mapper and Reducer classes to count the frequency of words in a text file. The main method sets up the job configuration and runs the job. Other sections provide links about the history of Spark and summaries of Spark surveys from 2016 and 2017 focusing on trends in machine learning, streaming, and scaling Spark applications.
Cloud functions are google’s Functions as a Service ( FaaS ) platform. As of right now it supports Node.js and Python runtimes. In this blog, we will show you how to enable Cross Origin Resource Sharing (CORS) for a Google Cloud Function using Python.
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
This document provides an overview of Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It describes Hadoop's core components like HDFS for distributed file storage and MapReduce for distributed processing. Key aspects covered include HDFS architecture, data flow and fault tolerance, as well as MapReduce programming model and architecture. Examples of Hadoop usage and a potential project plan for load balancing enhancements are also briefly mentioned.
The document discusses three methods for importing SQL scripts using Hibernate:
1. Specify the import file in hibernate.cfg.xml using the classpath location
2. Directly specify the file path to import.sql
3. Implement a listener to convert the SQL file to the proper encoding on application startup
This document discusses using Fabric for Python application deployment and configuration management. It provides an overview of Fabric basics like tasks, roles, and environments. It also describes using Fabric for common operations like code deployment, database migrations, and managing server growth. Key advantages of Fabric include its simple task-based interface and ability to control multiple servers simultaneously. The document provides an example of using Fabric for a full deployment process including pushing code, running migrations, and restarting processes.
The document contains code for a client-server chat application written in Java. It defines classes for the client, server, and thread handling communication between the server and clients. The client and server classes set up socket connections and input/output streams for sending and receiving messages. The server accepts new connections in a loop and spawns a new thread for each client. Threads read incoming messages from clients and print them, and take input from the server console to send back to clients.
This document provides information on storing and processing big data with Apache Hadoop and Cassandra. It discusses how to install and configure Cassandra and Hadoop, perform basic operations with their command line interfaces, and implement simple MapReduce jobs in Hadoop. Key points include how to deploy Cassandra and Hadoop clusters, store and retrieve data from Cassandra using Hector and CQL, and use high-level interfaces like Hive and Pig with Hadoop.
This slide shows you how to use Akka cluster in Java.
Source Code: https://github.com/jiayun/akka_samples
If you want to use the links in slide, please download the pdf file.
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Adin Ermie
In this new presentation, we will cover advanced Terraform topics (full-on DevOps). We will compare the deployment of Terraform using Azure DevOps, GitHub/GitHub Actions, and Terraform Cloud. We wrap everything up with some key takeaway learning resources in your Terraform learning adventure.
NOTE: A recording of this presenting is available here: https://www.youtube.com/watch?v=fJ8_ZbOIdto&t=5574s
This document discusses the Puppet configuration management tool. It provides an overview of Puppet including its open source nature, supported platforms, file structure, and types of resources it can manage like files, packages, services. It also discusses Facter for collecting system facts. Several examples are shown of how to configure files, packages, services. Finally Amazon EC2 is mentioned as a way to deploy Puppet in a scalable environment.
2. Hadoop Platforms
• Platforms: Unix and on Windows.
– Linux: the only supported production platform.
– Other variants of Unix, like Mac OS X: run Hadoop for
development.
– Windows + Cygwin: development platform (openssh)
• Java 6
– Java 1.6.x (aka 6.0.x aka 6) is recommended for
running Hadoop.
3. Hadoop Installation
• Download a stable version of Hadoop:
– http://hadoop.apache.org/core/releases.html
• Untar the hadoop file:
– tar xvfz hadoop-0.20.2.tar.gz
• JAVA_HOME at hadoop/conf/hadoop-env.sh:
– Mac OS:
/System/Library/Frameworks/JavaVM.framework/Versions
/1.6.0/Home (/Library/Java/Home)
– Linux: which java
• Environment Variables:
– export PATH=$PATH:$HADOOP_HOME/bin
4. Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is
easy to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode
– The Hadoop daemons run on a cluster of machines.
5. Pseudo Distributed Mode
• Create an RSA key to be used by hadoop when
ssh’ing to Localhost:
– ssh-keygen -t rsa -P ""
– cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
– ssh localhost
• Configuration Files
– Core-site.xml
– Mapredu-site.xml
– Hdfs-site.xml
– Masters/Slaves: localhost
12. import java.io.IOException;
try {
import org.apache.hadoop.conf.Configuration; FileStatus[] inputFiles = local.listStatus(inputDir);
import org.apache.hadoop.fs.FSDataInputStream; FSDataOutputStream out = hdfs.create(hdfsFile);
import org.apache.hadoop.fs.FSDataOutputStream; for(int i = 0; i < inputFiles.length; i++) {
import org.apache.hadoop.fs.FileStatus; if(!inputFiles[i].isDir()) {
import org.apache.hadoop.fs.FileSystem; System.out.println("tnow processing <" +
import org.apache.hadoop.fs.Path;
inputFiles[i].getPath().getName() + ">");
FSDataInputStream in =
public class PutMerge { local.open(inputFiles[i].getPath());
public static void main(String[] args) throws IOException {
if(args.length != 2) { byte buffer[] = new byte[256];
System.out.println("Usage PutMerge <dir> <outfile>"); int bytesRead = 0;
System.exit(1); while ((bytesRead = in.read(buffer)) > 0) {
} out.write(buffer, 0, bytesRead);
}
Configuration conf = new Configuration(); filesProcessed++;
FileSystem hdfs = FileSystem.get(conf); in.close();
FileSystem local = FileSystem.getLocal(conf); }
int filesProcessed = 0; }
out.close();
Path inputDir = new Path(args[0]); System.out.println("nSuccessfully merged " +
Path hdfsFile = new Path(args[1]); filesProcessed + " local files and written to <" +
hdfsFile.getName() + "> in HDFS.");
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
13. import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class MaxTemperature {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1); }
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
14. JobClient.runJob(conf)
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run.
The jobtracker is a Java application whose
main class is JobTracker.
• The tasktrackers, which run the tasks that the
job has been split into. Tasktrackers are Java
applications whose main class is TaskTracker.
• The distributed filesystem, which is used for
sharing job files between the other entities.
15.
16. Job Launch: Client
• Client program creates a JobConf
– Identify classes implementing Mapper and
Reducer interfaces
• setMapperClass(), setReducerClass()
– Specify inputs, outputs
• setInputPath(), setOutputPath()
– Optionally, other options too:
• setNumReduceTasks(), setOutputFormat()…
17. Job Launch: JobClient
• Pass JobConf to
– JobClient.runJob() // blocks
– JobClient.submitJob() // does not block
• JobClient:
– Determines proper division of input into
InputSplits
– Sends job data to master JobTracker server
18. Job Launch: JobTracker
• JobTracker:
– Inserts jar and JobConf (serialized to XML) in
shared location
– Posts a JobInProgress to its run queue
19. Job Launch: TaskTracker
• TaskTrackers running on slave nodes
periodically query JobTracker for work
• Retrieve job-specific jar and config
• Launch task in separate instance of Java
– main() is provided by Hadoop
20. Job Launch: Task
• TaskTracker.Child.main():
– Sets up the child TaskInProgress attempt
– Reads XML configuration
– Connects back to necessary MapReduce
components via RPC
– Uses TaskRunner to launch user process
21. Job Launch: TaskRunner
• TaskRunner, MapTaskRunner, MapRunner
work in a daisy-chain to launch Mapper
– Task knows ahead of time which InputSplits it
should be mapping
– Calls Mapper once for each record retrieved from
the InputSplit
• Running the Reducer is much the same
22.
23.
24. public class MaxTemperature {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1); }
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}}
25. public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
26. Creating the Mapper
• Your instance of Mapper should extend
MapReduceBase
• One instance of your Mapper is initialized by
the MapTaskRunner for a TaskInProgress
– Exists in separate process from all other instances
of Mapper – no data sharing!
28. public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
29. What is Writable?
• Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
• All values are instances of Writable
• All keys are instances of WritableComparable
30.
31. public class MyWritableComparable implements WritableComparable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable w) {
int thisValue = this.value;
int thatValue = ((IntWritable)o).value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
32. Getting Data To The Mapper
Input file Input file
InputSplit InputSplit InputSplit InputSplit
InputFormat
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
33. public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
34. Reading Data
• Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an
InputSplit
– Factory for RecordReader objects to extract (k, v)
records from the input source
35. FileInputFormat and Friends
• TextInputFormat
– Treats each „n‟-terminated line of a file as a value
• KeyValueTextInputFormat
– Maps „n‟- terminated text lines of “k SEP v”
• SequenceFileInputFormat
– Binary file of (k, v) pairs (passing data between the output
of one MapReduce job to the input of some other
MapReduce job)
• SequenceFileAsTextInputFormat
– Same, but maps (k.toString(), v.toString())
36. Filtering File Inputs
• FileInputFormat will read all files out of a
specified directory and send them to the
mapper
• Delegates filtering this file list to a method
subclasses may override
– e.g., Create your own “xyzFileInputFormat” to read
*.xyz from directory list
37. Record Readers
• Each InputFormat provides its own
RecordReader implementation
– Provides (unused?) capability multiplexing
• LineRecordReader
– Reads a line from a text file
• KeyValueRecordReader
– Used by KeyValueTextInputFormat
38. Input Split Size
• FileInputFormat will divide large files into
chunks
– Exact size controlled by mapred.min.split.size
• RecordReaders receive file, offset, and length
of chunk
• Custom InputFormat implementations may
override split size
– e.g., “NeverChunkFile”
39. public class ObjectPositionInputFormat extends
FileInputFormat<Text, Point3D> {
public RecordReader<Text, Point3D> getRecordReader(
InputSplit input, JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(input.toString());
return new ObjPosRecordReader(job, (FileSplit)input);
}
InputSplit[] getSplits(JobConf job, int numSplits) throuw IOException;
}
40. class ObjPosRecordReader implements RecordReader<Text, Point3D> {
public ObjPosRecordReader(JobConf job, FileSplit split) throws IOException
{}
public boolean next(Text key, Point3D value) throws IOException {
// get the next line}
public Text createKey() {
}
public Point3D createValue() {
}
public long getPos() throws IOException {
}
public void close() throws IOException {
}
public float getProgress() throws IOException {}
}
41. Sending Data To Reducers
• Map function receives OutputCollector object
– OutputCollector.collect() takes (k, v) elements
• Any (WritableComparable, Writable) can be
used
43. Sending Data To The Client
• Reporter object sent to Mapper allows simple
asynchronous feedback
– incrCounter(Enum key, long amount)
– setStatus(String msg)
• Allows self-identification of input
– InputSplit getInputSplit()
45. Partitioner
• int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce task
• HashPartitioner used by default
– Uses key.hashCode() to return partition num
• JobConf sets Partitioner implementation
46. public class MyPartitioner implements Partitioner<IntWritable,Text> {
@Override
public int getPartition(IntWritable key, Text value, int numPartitions) {
/* Pretty ugly hard coded partitioning function. Don't do that in
practice, it is just for the sake of understanding. */
int nbOccurences = key.get();
if( nbOccurences < 3 )
return 0;
else
return 1;
}
@Override
public void configure(JobConf arg0) {
}
}
conf.setPartitionerClass(MyPartitioner.class);
47. Reduction
• reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
• Keys & values sent to one partition all go to
the same reduce task
• Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
48. public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}