This is the Apache Spark session with examples.
It gives a brief idea about Apache Spark. Apache Spark is a fast and general engine for large-scale data processing.
By the end of this presentation you should be fairly clear about Apache Spark.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Learning Objectives - In this module, you will learn what is Pig, in which type of use case we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
This is the Apache Spark session with examples.
It gives a brief idea about Apache Spark. Apache Spark is a fast and general engine for large-scale data processing.
By the end of this presentation you should be fairly clear about Apache Spark.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Learning Objectives - In this module, you will learn what is Pig, in which type of use case we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
Azure's HDInsight provides an easy way to process big data using Spark, and learn from it using Machine Learning. See SparkML in action, and learn how to use R and Python at scale, within Jupyter.
製品/テクノロジ: AI (人工知能)/Deep Learning (深層学習)/Machine Learning (機械学習)/Microsoft Azure
Michael Lanzetta
Microsoft Corporation
Developer Experience and Evangelism
Principal Software Development Engineer
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings.
In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.
Java Persistence Frameworks for MongoDBTobias Trelle
After a short introduction to the MongoDB Java driver we'll have a detailed look at higher level persistence frameworks like Morphia, Spring Data MongoDB and Hibernate OGM with lots of examples.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
Azure's HDInsight provides an easy way to process big data using Spark, and learn from it using Machine Learning. See SparkML in action, and learn how to use R and Python at scale, within Jupyter.
製品/テクノロジ: AI (人工知能)/Deep Learning (深層学習)/Machine Learning (機械学習)/Microsoft Azure
Michael Lanzetta
Microsoft Corporation
Developer Experience and Evangelism
Principal Software Development Engineer
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings.
In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.
Java Persistence Frameworks for MongoDBTobias Trelle
After a short introduction to the MongoDB Java driver we'll have a detailed look at higher level persistence frameworks like Morphia, Spring Data MongoDB and Hibernate OGM with lots of examples.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.
Presentation about a few situations at Apigee where we wanted/needed interoperability between Node.js and Java. One example is Hadoop-based MapReduce. During this presentation I discuss some use cases, some of the problems and then finish talking about technology that allows you to write MapReduce jobs using Node.js and run them natively on Hadoop as if written in Java.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
How world-class product teams are winning in the AI era by CEO and Founder, P...
Design of a_dsl_by_ruby_for_heavy_computations
1. Design of a DSL by Ruby
for heavy computations
over map-reduce clusters
the 37th Grace seminar
16th June, 2010
Koichi Fujikawa
Cirius Technologies, Inc.
4. We Live in the "Big Data" era
World-wide web page data (Text-only) is expected
400TB (at one point).
Some web service company (like Google,
Yahoo, etc) have to process these data for
their business, but..
General HDD can read data in 50MB/sec. This
means we can take 2000 hours (approx. 100
days) to read the total web data(400TB) by one
machine.
We need the parallel processing / file system.
5. MapReduce
MapReduce is one of the parallel skeletons
Became popular by Google's paper(2004)
MapReduce has two phases
Map phase: transform key and value to
another (key and) value
Reduce phase: aggregate and calculate
values by one key
Each record process by map phase first and
then by reduce phase
6.
7. Hadoop
Hadoop is open source clone of Google
MapReduce hosted by Apache Foundation
Big web service provider(Yahoo, Facebook,
etc) contribute this project actively.
Large development and user community all
over the world (including Japan)
Hadoop conference Japan 2009
Hadoop source code reading events
9. Programming Model
General programmers, engineers are not
familiar with this "MapReduce" model, so it is
too difficult to try and use
Especially to separate Map and Reduce
No Effective way of the "pattern of the
MapRecuce programming" because this
technology is not mature for the engineers.
We have to find this individually. It is very
difficult and time-consuming.
10. Programming Language
Hadoop is written in Java language, so the
programmers need to write Map and Reduce
procedure in Java.
Java is strong typed and compile language.
Some web service engineer don't like these
language.
No problem if the code is fixed and
completed, but I wonder it is suitable for ad-
hoc prototyping and easy querying.
MapReduce jobs depend on what users want to
get, so flexibility is important, I think.
12. Hide complexity of MapReduce
I found the description for MapReduce could
be simpler in some specific case (e.g. log
analysis).
In this case (but almost all of Hadoop usage is
now log analysis), it would be nice if
programmers can write the description without
taking care of MapReduce!
13. DSL approach by Ruby
For this description, I created DSL for each
specific usage.
Log analysis DSL is a reference
implementation which I prepared.
As DSL runtime environment for Hadoop, I
chose Ruby and JRuby, which is Ruby
runtime working on JVM.
Ruby is very flexible and reusable object-
oriented language, so very easy to create
DSL processor.
15. Hadoop Papyrus
DSL framework for Hadoop by JRuby
We can write log analysis code by
only several line.
Open source (Apache Licence) same as
Hadoop
Hosted by github
Distributed by common Ruby archive site
RubyGems.org
Supported by IPA mitoh 2009
20. On the way to big challenge
We need parallel processing method to
handle massive web-scale data.
MapReduce and Hadoop is one of good tools,
but..
Difficult to describe Map and Reduce
Irritated to write Java for someone :-)
Hadoop Papyrus is providing the key!
Ruby-based DSL framework for Hadoop
You can write Map and Reduce at once