Scalding is a Scala library built on top of Cascading that simplifies the process of defining MapReduce programs. It uses a functional programming approach where data flows are represented as chained transformations on TypedPipes, similar to operations on Scala iterators. This avoids some limitations of the traditional Hadoop MapReduce model by allowing for more flexible multi-step jobs and features like joins. The Scalding TypeSafe API also provides compile-time type safety compared to Cascading's runtime type checking.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
In the past year there has been a tremendous amount of activity on Scala APIs for Hadoop. In this talk we`ll talk about writing Map/Reduce jobs in a more functional manner and explore the three most popular Scala packages for Hadoop: Scalding, Scoobi and Scrunch. Detailed usage examples will be provided for each along with some real world use cases.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The Sparkling Water project brings integration of H2O platform with Spark Ecosystem.
In this hands-on session we will show how to start with Sparkling Water - launching H2O services on top of Spark, loading data with help of H2O, manipulating data with help of Spark and H2O APIs, and finally, preparing and using DeepLearning and GBM models.
The presentation also demonstrates using R to access results prepared with help of Sparkling Water.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark.
By Michal Malohlava and H2O.ai
Our 100th Meetup at 0xdata, September 30, 2014
Open Source meets Out Door.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
In the past year there has been a tremendous amount of activity on Scala APIs for Hadoop. In this talk we`ll talk about writing Map/Reduce jobs in a more functional manner and explore the three most popular Scala packages for Hadoop: Scalding, Scoobi and Scrunch. Detailed usage examples will be provided for each along with some real world use cases.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The Sparkling Water project brings integration of H2O platform with Spark Ecosystem.
In this hands-on session we will show how to start with Sparkling Water - launching H2O services on top of Spark, loading data with help of H2O, manipulating data with help of Spark and H2O APIs, and finally, preparing and using DeepLearning and GBM models.
The presentation also demonstrates using R to access results prepared with help of Sparkling Water.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark.
By Michal Malohlava and H2O.ai
Our 100th Meetup at 0xdata, September 30, 2014
Open Source meets Out Door.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
2. What is Scalding
Scalding is a Scala library written on top of Cascading that makes
it easy to define MapReduce programs
2/21
3. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
3/21
4. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
4/21
5. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
5/21
6. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
Map: a function from an input key/value pair to a list of
intermediate key/value pairs
map : (keyinput , valueinput ) → list(keymap , valuemap )
5/21
7. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
Map: a function from an input key/value pair to a list of
intermediate key/value pairs
map : (keyinput , valueinput ) → list(keymap , valuemap )
Reduce: a function from an intermediate key/values pairs to
a list of output key/value pairs
reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )
5/21
8. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
6/21
9. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3
combine(Vm1,Vm5)=Vm6
6/21
10. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3
combine(Vm1,Vm5)=Vm6
The Reduce Phase:
Km Vm3
3
Km3 Vm3 Output
3 3
Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data
Shuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest
Kr2 Vr2
Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7
6/21
11. Example: Word Count 1/2
1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
2
3 public void map(Object key, Text value, Context context)
4 throws IOException, InterruptedException {
5 StringTokenizer itr = new StringTokenizer(value.toString());
6 while (itr.hasMoreTokens()) {
7 word.set(itr.nextToken());
8 context.write(new Text(word), new IntWritable(1));
9 }
10 }
11 }
12
13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
14
15 public void reduce(Text key, Iterable<IntWritable> values,
16 Context context
17 ) throws IOException, InterruptedException {
18 int sum = 0;
19 for (IntWritable val : values)
20 sum += val.get();
21 context.write(key, new IntWritable(sum));
22 }
23 }
7/21
12. Example: Word Count 2/2
1 public class WordCount {
2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }
8/21
13. Example: Word Count 2/2
1 public class WordCount {
2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7 job.setCombinerClass(IntSumReducer.class);
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }
Sending the integer 1 for each instance of a word is very
inefficient (1TB of data yields 1TB+ of data)
Hadoop doesn’t know if it can use the reducer as combiner. A
manual set is needed
8/21
14. Hadoop weaknesses
The reducer cannot be always used as combiner, Hadoop
relies on the combiner specification or on manual partial
aggregation inside the mapper instance life cycle (in-mapper
combiner)
Combiners are limited to associative and commutative
functions (like sum). Partial aggregation is more general and
powerful
Programming model limited to the map/reduce phases
model, multi-job programs are often difficult and
counter-intuitive (think about iterative algorithms like
PageRank)
Joins can be difficult, many techniques must be
implemented from scratch
More in general, MapReduce is indeed simple but many
optimizations are similar to hacks and not so natural
9/21
15. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
10/21
16. Cascading
Open source project developed @Concurrent
It is Java application framework on top of Hadoop developed
to be extendible by providing:
Processing API: to develop complex data flows
Integration API: integration test supported by the framework
to avoid to put in production unstable software
Scheduling API: used to schedule unit of work from any
third-party application
It changes the MapReduce programming model to a more
generic data flow oriented programming model
Cascading has a data flow optimizer that converts user data
flows to optimized data flows
11/21
17. Cascading Programming Model
A Cascading program is composed by flows
A flow is composed by a source tap, a sink tap and pipes
that connect them
A pipe holds a particular transformation over its input data
flow
Pipes can be combined to create more complex programs
12/21
18. Example: Word Count
MapReduce word count concept:
Map(tokenize text
and emit 1 for Reduce(count values
1 1
TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine
Data Data
Shuffle
Source Dest
Cascading word count concept:
TextLine
tokenize each line group by tokens count values in every group
TextLine
13/21
19. Example: Word Count
1 public class WordCount {
2 public static void main( String[] args ) {
3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
5
6 RegexSplitGenerator s = new RegexSplitGenerator(
7 new Fields("token"),
8 "[ [](),.]" );
9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10 Fields.RESULTS ); // text -> token
11
12 Pipe wcPipe = new Pipe( "wc", docPipe );
13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
15
16 // connect the taps and pipes to create a flow definition
17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18 .addSource( docPipe, docTap )
19 .addTailSink( wcPipe, wcTap );
20
21 getFlowConnector().connect( flowDef ).complete();
22 }
23 }
14/21
20. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
15/21
21. Scalding
Open source project developed @Twitter
16/21
22. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
16/21
23. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
The two APIs can work together using pipe.typed and
TypedPipe.from
16/21
24. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
The two APIs can work together using pipe.typed and
TypedPipe.from
This presentation is about the TypeSafe API ¨
16/21
26. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
17/21
27. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
Cascading works on objects representing functions and uses
constructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );
2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );
17/21
28. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
Cascading works on objects representing functions and uses
constructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );
2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );
Functional programming can naturally describe data flows:
every pipe can be seen as a function working and pipes can be
combined using functional compositing. The code above can
be written as:
1 docPipe.groupBy( new Fields( "token" ) )
2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
29. Example: Word Count
1 class WordCount(args : Args) extends Job(args) {
2
3 /* TextLine reads each line of the given file */
4 val input = TypedPipe.from( TextLine( args( "input" ) ) )
5
6 /* tokenize every line and flat the result into a list of words */
7 val words = input.flatMap{ tokenize(_) }
8
9 /* group by words and add a new field size that is the group size */
10 val wordGroups = words.groupBy{ identity(_) }.size
11
12 /* write each pair (word,count) as line using TextLine */
13 wordGroups.write((0,1), TextLine( args( "output" ) ) )
14
15 /* Split a piece of text into individual words */
16 def tokenize(text : String) : Array[String] = {
17 // Lowercase each word and remove punctuation.
18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19 .split("s+")
20 }
21 }
18/21
31. Scalding TypeSafe API
Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )
19/21
32. Scalding TypeSafe API
Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )
KeyedList[K,V]: trait that represents a sharded lists of
items. Two implementations:
Grouped[K,V]: represents a grouping on keys of type K
CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins
19/21
33. Conclusions
MapReduce API is powerful but limited
20/21
34. Conclusions
MapReduce API is powerful but limited
Cascading API is as simple as the MapReduce API but more
generic and powerful
20/21
35. Conclusions
MapReduce API is powerful but limited
Cascading API is as simple as the MapReduce API but more
generic and powerful
Scalding combines Cascading and Scala to easily describe
distributed programs. Major strength points are:
Functional programming to naturally describe data flows.
Scalding is similar to Scala library, if you know Scala then
you already know how to use Scalding
Statically typed (TypeSafe API), no type errors at runtime
Scala is standard and works on top of the JVM
Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.
20/21