This document provides an overview of the Apache Hadoop API for input formats. It discusses the responsibilities of input formats, common input formats like TextInputFormat and KeyValueTextInputFormat, and binary formats like SequenceFileInputFormat. It also covers the InputFormat and RecordReader classes, using mappers to process input splits, and considerations for keys and values.
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUsty1er
Current state of the art in information dissemination com- prises of publishers broadcasting XML-coded documents, in turn selectively forwarded to interested subscribers. The de- ployment of XML at the heart of this setup greatly increases the expressive power of the profiles listed by subscribers, using the XPath language. On the other hand, with great expressive power comes great performance responsibility: it is becoming harder for the matching infrastructure to keep up with the high volumes of data and users. Traditionally, general purpose computing platforms have generally been favored over customized computational setups, due to the simplified usability and significant reduction of development time. The sequential nature of these general purpose com- puters however limits their performance scalability. In this work, we propose the implementation of the filtering infras- tructure using the massively parallel Graphical Processing Units (GPUs). We consider the holistic (no post-processing) evaluation of thousands of complex twig-style XPath queries in a streaming (single-pass) fashion, resulting in a speedup over CPUs up to 9x in the single-document case and up to 4x for large batches of documents. A thorough set of exper- iments is provided, detailing the varying effects of several factors on the CPU and GPU filtering platforms
This presentation is an introduction to Dotty / Scala 3.
It covers the features which I deem most important for Scala developers.
For detailed information see the [Dotty documentation](https://dotty.epfl.ch/docs/index.html).
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUsty1er
Current state of the art in information dissemination com- prises of publishers broadcasting XML-coded documents, in turn selectively forwarded to interested subscribers. The de- ployment of XML at the heart of this setup greatly increases the expressive power of the profiles listed by subscribers, using the XPath language. On the other hand, with great expressive power comes great performance responsibility: it is becoming harder for the matching infrastructure to keep up with the high volumes of data and users. Traditionally, general purpose computing platforms have generally been favored over customized computational setups, due to the simplified usability and significant reduction of development time. The sequential nature of these general purpose com- puters however limits their performance scalability. In this work, we propose the implementation of the filtering infras- tructure using the massively parallel Graphical Processing Units (GPUs). We consider the holistic (no post-processing) evaluation of thousands of complex twig-style XPath queries in a streaming (single-pass) fashion, resulting in a speedup over CPUs up to 9x in the single-document case and up to 4x for large batches of documents. A thorough set of exper- iments is provided, detailing the varying effects of several factors on the CPU and GPU filtering platforms
This presentation is an introduction to Dotty / Scala 3.
It covers the features which I deem most important for Scala developers.
For detailed information see the [Dotty documentation](https://dotty.epfl.ch/docs/index.html).
This presentation shows the feature updates from Scala 2.12 to 2.13.
The list of features is not comprehensive, but it is my personal selection of favorites.
I will focus on those which IMO impact/ease the programmers live most.
I will look at 5 feature areas: compiler, standard library, language changes,
Future and finally the most important change the redesigned collections library.
I will not only show the new features of 2.13. In many cases I will show how the
new features of 2.13 can be backported to 2.12 und be used in mostly the same way as in 2.13.
Finally I'll give some guide lines for the migration from 2.12 to 2.13 and for a cross version
project which compiles a code base with both compiler versions.
I am Joseph G . I am a Programming Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. Programming, Schiller International University, USA. I have been helping students with their homework for the past 8 years. I solve assignments related to Programming.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com.
You can also call on +1 678 648 4277 for any assistance with Programming Assignments.
Following presentation gives the brief view about dynamic memory allocation used for allocating space at runtime.
Go through the slides hope it will be helpful to get the basic knowledge about the dynamic memory allocation.
Please comment and shares your views.
This presentation considers certain specific features of C++11 and additions to STL library (uniform initialization, new containers and methods, move semantics).
Presentation by Taras Protsiv (Software Engineer, GlobalLogic), Kyiv, delivered at GlobalLogic C++ TechTalk in Lviv, September 18, 2014.
More details -
http://www.globallogic.com.ua/press-releases/lviv-cpp-techtalk-coverage
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
This presentation shows the feature updates from Scala 2.12 to 2.13.
The list of features is not comprehensive, but it is my personal selection of favorites.
I will focus on those which IMO impact/ease the programmers live most.
I will look at 5 feature areas: compiler, standard library, language changes,
Future and finally the most important change the redesigned collections library.
I will not only show the new features of 2.13. In many cases I will show how the
new features of 2.13 can be backported to 2.12 und be used in mostly the same way as in 2.13.
Finally I'll give some guide lines for the migration from 2.12 to 2.13 and for a cross version
project which compiles a code base with both compiler versions.
I am Joseph G . I am a Programming Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. Programming, Schiller International University, USA. I have been helping students with their homework for the past 8 years. I solve assignments related to Programming.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com.
You can also call on +1 678 648 4277 for any assistance with Programming Assignments.
Following presentation gives the brief view about dynamic memory allocation used for allocating space at runtime.
Go through the slides hope it will be helpful to get the basic knowledge about the dynamic memory allocation.
Please comment and shares your views.
This presentation considers certain specific features of C++11 and additions to STL library (uniform initialization, new containers and methods, move semantics).
Presentation by Taras Protsiv (Software Engineer, GlobalLogic), Kyiv, delivered at GlobalLogic C++ TechTalk in Lviv, September 18, 2014.
More details -
http://www.globallogic.com.ua/press-releases/lviv-cpp-techtalk-coverage
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Introduction To Elastic MapReduce at WHUGAdam Kawa
Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.
Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
Link to video: http://www.youtube.com/watch?v=_GNbn4RzZcQ
A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a cluster and listening to perfect music, something surprising might happen.. What? Well, a data engineer starts discovering Hadoop concepts in the lyrics of many songs! How can Coldplay, The Black Eyed Peas, Michael Jackson sing about Hadoop? (more at blog: http://hakunamapdata.com/hadoop-playlist-at-spotify/)
Is life insurance tax deductible in super?Chris Strano
The various types of personal insurances you can own within superannuation and the potential deductibility of insurance premiums.
More information at
http://www.superguy.com.au/is-life-insurance-tax-deductible/
Owning a vacant building can pose serious liabilities because vacant buildings are more susceptible to vandalism, undetected repairs, fire and other losses. If you own vacant property, it is advisable to purchase Vacant Property Insurance, also known as Vacant Building Insurance or Vacant Dwelling Insurance, to protect against risks.
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Lars Crama
Bridging the Gap Between Digital and Relationship Marketing: The Next Big Thing for Data-Driven Marketers. Presentation by Selligent and 2organize at DMA2013 in Chicago
Social presence theory is a central concept in online learning. Hundreds of studies have investigated social presence and online learning. However, despite the continued interest in social presence and online learning, many questions remain about the nature and development of social presence. Part of this might be due to the fact that the majority of past research has focused on students' perceptions of social presence rather than on how students actually establish their social presence in online learning environments. Using the Community of Inquiry Framework, this study explores how social presence manifests in a fully asynchronous online course in order to help instructional designers and faculty understand how to intentionally design opportunities for students to establish and maintain their social presence. This study employs a mixed-methods approach using word count, content analysis, and constant-comparison analysis to examine threaded discussions in a totally online graduate education course. The results of this study suggest that social presence is more complicated than previously imagined and that situational variables such as group size, instructional task, and previous relationships might influence how social presence is established and maintained in threaded discussions in a fully online course.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
Stratosphere is the next generation big data processing engine.
These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop.
For more information, visit stratosphere.eu
Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
In recent years we have seen explosion of languages which run on Java Virtual Machine. We also have seen existing languages getting their implementations being rewritten to JVM. With all of the above we have seen rapid development of tools like parsers, bytecode generators and such, even inside JVM we saw initiatives like Da Vinci Machine Project, which led to invoke dynamic in JDK 7 and recent development of Graal and Truffle projects.
Is it really hard to write new programming language running on JVM? Even if you are not going to write your own I think it is worth to understand how your favorite language runs undercover, how early decisions can impact language extensibility and performance, what JVM itself and JVM ecosystem has to offer to language implementors.
During the session I will try to get you familiar with options you have when choosing parsers and byte code manipulation libraries. which language implementation to consider, how to test and tune your "new baby". Will you be able after this session to develop new and shiny language, packed with killer features language? No. But for sure you will understand difference between lexers and parsers, how bytecode works, why invoke dynamic and Graal and Truffle are so important to the future of JVM platform. Will we have time to write simple, compiled language?
Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
3. InputFormat Reposibilities
Divide input data into logical input splits
Data in HDFS is divided into block, but processed as input
splits
InputSplit may contains any number of blocks (usually 1)
Each Mapper processes one input split
Creates RecordReaders to extract <key, value> pairs
2/24/13
4. InputFormat Class
public abstract class InputFormat<K, V> {
public abstract
List<InputSplit> getSplits(JobContext context) throws ...;
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws ...;
}
2/24/13
5. Most Common InputFormats
TextInputFormat
Each n-terminated line is a value
The byte offset of that line is a key
Why not a line number?
KeyValueTextInputFormat
Key and value are separated by a separator (tab by default)
2/24/13
6. Binary InputFormats
SequenceFileInputFormat
SequenceFiles are flat files consisting of binary <key,
value> pairs
AvroInputFormat
Avro supports rich data structures (not necessarily <key,
value> pairs) serialized to files or messages
Compact, fast, language-independent, self-describing,
dynamic
2/24/13
7. Some Other InputFormats
NLineInputFormat
Should not be too big since splits are calculated in a single
thread (NLineInputFormat#getSplitsForFile)
CombineFileInputFormat
An abstract class, but not so difficult to extend
SeparatorInputFormat
How to here: http://blog.rguha.net/?p=293
2/24/13
8. Some Other InputFormats
MultipleInputs
Supports multiple input paths with a different
InputFormat and Mapper for each path
MultipleInputs.addInputPath(job,
firstPath, FirstInputFormat.class, FirstMapper.class);
MultipleInputs.addInputPath(job,
secondPath, SecondInputFormat.class, SecondMapper.class);
2/24/13
10. InputFormat Interesting Facts
Ideally InputSplit size is equal to HDFS block size
Or InputSplit contains multiple collocated HDFS block
InputFormat may prevent splitting a file
A whole file is processed by a single mapper (e.g. gzip)
boolean FileInputFormat#isSplittable();
2/24/13
11. InputFormat Interesting Facts
Mapper knows the file/offset/size of the split that it process
MapContext#getInputSplit()
Useful for later debugging on a local machine
2/24/13
12. InputFormat Interesting Facts
PathFilter (included in InputFormat) specifies which files
to include or not into input data
PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
2/24/13
14. RecordReader Logic
Must handle a common situation when InputSplit and
HDFS block boundaries do not match
2/24/13
Image source: Hadoop: The Definitive Guide by Tom White
15. RecordReader Logic
Exemplary solution – based on LineRecordReader
Skips* everything from its block until the fist 'n'
Reads from the second block until it sees 'n'
*except the very first block (an offset equals to 0)
2/24/13
Image source: Hadoop: The Definitive Guide by Tom White
16. Keys And Values
Keys must implement WritableComparable interface
Since they are sorted before passing to the Reducers
Values must implement “at least” Writable interface
2/24/13
18. Writable And WritableComparable
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
public interface WritableComparable<T> extends Writable,
Comparable<T> {
}
public interface Comparable<T> {
public int compareTo(T o);
}
2/24/13
19. Example: SongWritable
class SongWritable implements Writable {
String title;
int year;
byte[] content;
…
public void write(DataOutput out) throws ... {
out.writeUTF(title);
out.writeInt(year);
out.writeInt(content.length);
out.write(content);
}
}
2/24/13
20. Mapper
Takes input in form of a <key, value> pair
Emits a set of intermediate <key, value> pairs
Stores them locally and later passes to the Reducers
But earlier: partition + sort + spill + merge
2/24/13
22. MapContext Object
Allow the user map code to communicate with MapReduce system
public InputSplit getInputSplit();
public TaskAttemptID getTaskAttemptID();
public void setStatus(String msg);
public boolean nextKeyValue() throws ...;
public KEYIN getCurrentKey() throws ...;
public VALUEIN getCurrentValue() throws ...;
public void write(KEYOUT key, VALUEOUT value) throws ...;
public Counter getCounter(String groupName, String counterName);
2/24/13
23. Examples Of Mappers
Implement highly specialized Mappers and reuse/chain them
when possible
IdentityMapper
InverseMapper
RegexMapper
TokenCounterMapper
2/24/13
24. TokenCounterMapper
public class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
2/24/13
25. General Advices
Reuse Writable instead of creating a new one each time
Apache commons StringUtils class seems to be the most
efficient for String tokenization
2/24/13
26. Chain Of Mappers
Use multiple Mapper classes within a single Map task
The output of the first Mapper becomes the input of the
second, and so on until the last Mapper
The output of the last Mapper will be written to the task's
output
Encourages implementation of reusable and highly
specialized Mappers
2/24/13
28. Partitioner
Specifies which Reducer a given <key, value> pair is sent to
Desire even distribution of the intermediate data
Skewed data may overload a single reducer and make a whole
job running longer
public abstract class Partitioner<KEY, VALUE> {
public abstract
int getPartition(KEY key, VALUE value, int numPartitions);
}
2/24/13
29. HashPartitioner
The default choice for general-purpose use cases
public int getPartition(K key, V value, int numReduceTasks) {
return
(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
2/24/13
32. TotalOrderPartitioner
Three samplers
InputSampler.RandomSampler<K,V>
Sample from random points in the input
InputSampler.IntervalSampler<K,V>
Sample from s splits at regular intervals
InputSampler.SplitSampler<K,V>
Samples the first n records from s splits
2/24/13
34. Reducer Run Method
public void run(Context context) throws … {
setup(context);
while (context.nextKey()) {
reduce(context.getCurrentKey(),
context.getValues(), context);
}
cleanup(context);
}
2/24/13
35. Chain Of Mappers After A Reducer
The ChainReducer class allows to chain multiple Mapper classes after a
Reducer within the Reducer task
Combined with ChainMapper, one could get [MAP+ / REDUCE MAP*]
ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
Text.class, Text.class, true, reduceConf);
ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
LongWritable.class, Text.class, false, null);
ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
LongWritable.class, LongWritable.class, true, null);
2/24/13
39. Job Class Methods
public void setInputFormatClass(..); public void setNumReduceTasks(int tasks);
public void setOutputFormatClass(..); public void setJobName(String name);
public void setMapperClass(..); public float mapProgress();
public void setCombinerClass(..); public float reduceProgress();
public void setReducerClass(...); public boolean isComplete();
public void setPartitionerClass(..); public boolean isSuccessful();
public void setMapOutputKeyClass(..); public void killJob();
public void setMapOutputValueClass(..); public void submit();
public void setOutputKeyClass(..); public boolean waitForCompletion(..);
public void setOutputValueClass(..);
public void setSortComparatorClass(..);
public void setGroupingComparatorClass(..);
2/24/13
40. ToolRunner
Supports parsing allows the user to specify configuration
options on the command line
hadoop jar examples.jar SongCount
-D mapreduce.job.reduces=10
-D artist.gender=FEMALE
-files dictionary.dat
-jar math.jar,spotify.jar
songs counts
2/24/13
41. Side Data Distribution
public class MyMapper<K, V> extends Mapper<K,V,V,K> {
String gender = null;
File dictionary = null;
protected void setup(Context context) throws … {
Configuration conf = context.getConfiguration();
gender = conf.get(“artist.gender”, “MALE”);
dictionary = new File(“dictionary.dat”);
}
2/24/13
42. public class WordCount extends Configured implements Tool {
public int run(String[] otherArgs) throws Exception {
if (args.length != 2) {
System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
...
return job.waitForCompletion(true); ? 0 : 1;
}
}
public static void main(String[] allArgs) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs);
System.exit(exitCode);
}
2/24/13
43. MRUnit
Built on top of JUnit
Provides a mock InputSplit, Contex and other classes
Can test
The Mapper class,
The Reducer class,
The full MapReduce job
The pipeline of MapReduce jobs
2/24/13
44. MRUnit Example
public class IdentityMapTest extends TestCase {
private MapDriver<Text, Text, Text, Text> driver;
@Before
public void setUp() {
driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>());
}
@Test
public void testMyMapper() {
driver
.withInput(new Text("foo"), new Text("bar"))
.withOutput(new Text("oof"), new Text("rab"))
.runTest();
}
}
2/24/13
45. Example: Secondary Sort
reduce(key, Iterator<value>) method gets iterator
over values
These values are not sorted for a given key
Sometimes we want to get them sorted
Useful to find minimum or maximum value quickly
2/24/13
46. Secondary Sort Is Tricky
A couple of custom classes are needed
WritableComparable
Partitioner
SortComparator (optional, but recommended)
GroupingComparator
2/24/13
48. Custom Partitioner
HashPartitioner uses a hash on keys
The same titles may go to different reducers (because they are
combined with ts in a key)
Use a custom partitioner that partitions only on first part of the key
int getPartition(TitleWithTs key, LongWritable value, int num) {
return hashParitioner.getPartition(key.title);
}
2/24/13
49. Ordering Of Keys
Keys needs to be ordered before passing to the reducer
Orders by natural key and, for the same natural key, on the
value portion of the key
Implement sorting in WritableComparable or use
Comparator class
job.setSortComparatorClass(SongWithTsComparator.class);
2/24/13
50. Data Passed To The Reducer
By default, each unique key forces reduce() method
(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4) → reduce method is invoked
(Disturbia#7, 7) → reduce method is invoked
(Fast car#2, 2) → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6) → reduce method is invoked
(SOS#4, 4) → reduce method is invoked
2/24/13
51. Data Passed To The Reducer
GroupingComparatorClass class determines which keys and
values are passed in a single call to the reduce method
Just look at the natural key when grouping
(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4)
(Disturbia#7, 7)
(Fast car#2, 2) → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6)
(SOS#4, 4) → reduce method is invoked
2/24/13
53. Question – A Possible Answer
Implement TotalSort, but
Each Reducer produce an additional file containing a pair
<minimum_value, number_of_values>
After the job ends, a single-thread application
Reads these files to build the index
Calculate which value in which file is the median
Finds this value in this file
2/24/13