The document discusses optimizing MapReduce pipelines in Apache Crunch. It describes how Crunch can orchestrate and optimize MapReduce jobs by chaining together map, group, and reduce functions. It provides an example of a pipeline that reads data from a file, counts visitors using parallel mapping, groups the results by key, and writes the final counts to an output file. The document notes that network traffic between nodes is often a bottleneck, and discusses how preprocessing data during mapping can reduce network traffic and improve performance compared to a standard MapReduce job.
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
The United States will be deploying 16,000 traffic speed monitoring sensors - 1 on every mile of US interstate in urban centers. These sensors update the speed, weather, and pavement conditions once per minute. MongoDB will collect and aggregate live sensor data feeds from roadways around the country, support real-time queries from cars on traffic conditions on their route as well as be the platform for real-time dashboards displaying traffic conditions and more complex analytical queries used to identify traffic trends. In this session, we’ll implement a few different data aggregation techniques to query and dashboard the metrics gathered from the US interstate.
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkFlink Forward
http://flink-forward.org/kb_sessions/joining-infinity-windowless-stream-processing-with-flink/
The extensive set of high-level Flink primitives makes it easy to join windowed streams. However, use cases that don’t have windows can prove to be more complicated, making it necessary to leverage operator state and low-level primitives to manually implement a continuous join. This talk will focus on the anomalies that present themselves when performing streaming joins with infinite windows, and the problems encountered operating topologies that back user-facing data. We will describe the approach taken at ResearchGate to implement and maintain a consistent join result of change data capture streams.
MongoDB - Back to Basics - La tua prima ApplicazioneMassimo Brignoli
Eccoci alla seconda puntata della serie Back to Basics edizione 2017. Vedremo come sviluppare un'applicazione con MongoDB studiando come interagire con la base dati. Vedremo come fare le query, creare un indice e studiarne il piano di esecuzione
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
The United States will be deploying 16,000 traffic speed monitoring sensors - 1 on every mile of US interstate in urban centers. These sensors update the speed, weather, and pavement conditions once per minute. MongoDB will collect and aggregate live sensor data feeds from roadways around the country, support real-time queries from cars on traffic conditions on their route as well as be the platform for real-time dashboards displaying traffic conditions and more complex analytical queries used to identify traffic trends. In this session, we’ll implement a few different data aggregation techniques to query and dashboard the metrics gathered from the US interstate.
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkFlink Forward
http://flink-forward.org/kb_sessions/joining-infinity-windowless-stream-processing-with-flink/
The extensive set of high-level Flink primitives makes it easy to join windowed streams. However, use cases that don’t have windows can prove to be more complicated, making it necessary to leverage operator state and low-level primitives to manually implement a continuous join. This talk will focus on the anomalies that present themselves when performing streaming joins with infinite windows, and the problems encountered operating topologies that back user-facing data. We will describe the approach taken at ResearchGate to implement and maintain a consistent join result of change data capture streams.
MongoDB - Back to Basics - La tua prima ApplicazioneMassimo Brignoli
Eccoci alla seconda puntata della serie Back to Basics edizione 2017. Vedremo come sviluppare un'applicazione con MongoDB studiando come interagire con la base dati. Vedremo come fare le query, creare un indice e studiarne il piano di esecuzione
This presentation is showing how to use the Aggregation Framework, the powerful aggregation language of MongoDB. Using some real data coming from the USA Census, we will discover the most important operations.
MongoDB offers two native data processing tools: MapReduce and the Aggregation Framework. MongoDB’s built-in aggregation framework is a powerful tool for performing analytics and statistical analysis in real-time and generating pre-aggregated reports for dashboarding. In this session, we will demonstrate how to use the aggregation framework for different types of data processing including ad-hoc queries, pre-aggregated reports, and more. At the end of this talk, you should walk aways with a greater understanding of the built-in data processing options in MongoDB and how to use the aggregation framework in your next project.
Webinar: Exploring the Aggregation FrameworkMongoDB
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Watch this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo an analysis of U.S. census data.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
Extending Slate Queries & Reports with JSON & JQUERYJonathan Wehner
Extending Queries & Reports with JSON & JQUERY
Presentation delivered as part of the Jedi track at the 2014 Slate Innovation Summit, June 26, 2014 in Hartford, CT.
Presenters: Jonathan Wehner and Bob McCullough
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
[DRAFT] Workshop - Technical Introduction to joola.ioItay Weinberger
This workshop focuses on hands-on introduction to joola.io
For a complete breakdown of the Workshop itself, refer to the project's wiki @ http://github.com/joola/joola.io/wiki/workshops
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd
Presented at the webinar, June 26, 2019
Materialized views are a killer feature of ClickHouse that can speed up queries 20X or more. Our webinar will teach you how to use this potent tool starting with how to create materialized views and load data. We'll then walk through cookbook examples to solve practical problems like deriving aggregates that outlive base data, answering last point queries, and using AggregateFunctions to handle problems like counting unique values, which is a special ClickHouse feature. There will be time for Q&A at the end. At that point you'll be a wizard of ClickHouse materialized views and able to cast spells of your own.
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
This presentation is showing how to use the Aggregation Framework, the powerful aggregation language of MongoDB. Using some real data coming from the USA Census, we will discover the most important operations.
MongoDB offers two native data processing tools: MapReduce and the Aggregation Framework. MongoDB’s built-in aggregation framework is a powerful tool for performing analytics and statistical analysis in real-time and generating pre-aggregated reports for dashboarding. In this session, we will demonstrate how to use the aggregation framework for different types of data processing including ad-hoc queries, pre-aggregated reports, and more. At the end of this talk, you should walk aways with a greater understanding of the built-in data processing options in MongoDB and how to use the aggregation framework in your next project.
Webinar: Exploring the Aggregation FrameworkMongoDB
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Watch this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo an analysis of U.S. census data.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
Extending Slate Queries & Reports with JSON & JQUERYJonathan Wehner
Extending Queries & Reports with JSON & JQUERY
Presentation delivered as part of the Jedi track at the 2014 Slate Innovation Summit, June 26, 2014 in Hartford, CT.
Presenters: Jonathan Wehner and Bob McCullough
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
[DRAFT] Workshop - Technical Introduction to joola.ioItay Weinberger
This workshop focuses on hands-on introduction to joola.io
For a complete breakdown of the Workshop itself, refer to the project's wiki @ http://github.com/joola/joola.io/wiki/workshops
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd
Presented at the webinar, June 26, 2019
Materialized views are a killer feature of ClickHouse that can speed up queries 20X or more. Our webinar will teach you how to use this potent tool starting with how to create materialized views and load data. We'll then walk through cookbook examples to solve practical problems like deriving aggregates that outlive base data, answering last point queries, and using AggregateFunctions to handle problems like counting unique values, which is a special ClickHouse feature. There will be time for Q&A at the end. At that point you'll be a wizard of ClickHouse materialized views and able to cast spells of your own.
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
It's the year 2015, and while we don't have hoverboards and self drying jackets we do have the next best thing, an Open Source Connector Between Apache Spark and Cassandra. Explore the general architecture of the connector and become an expert on how Spark and Cassandra can work together in harmony. Learn about how the DataStax Enterprise Integration with Spark provides exciting new features like Paxos Enabled High Availability to the Spark Master. Also get a sneak peak at the new and exciting features to come in the Spark Connector and the DSE Integration! If you are writing a Spark application that needs access to Cassandra, this talk is for you.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Streaming Operational Data with MariaDB MaxScaleMariaDB plc
MariaDB experts explain how to stream data using MariaDB MaxScale, a database proxy that can vastly improve your server's transactional data processing without sacrificing scalability, security or speed. In this webinar, learn how to use MaxScale to convert data to JSON documents or AVRO objects, and watch as MariaDB's senior software engineers do a live demo of how to use the Kafka producer.
Watch the webinar here: https://mariadb.com/resources/webinars/streaming-operational-data-mariadb-maxscale
Similar to Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce (20)
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
4. COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Todos os seus dados não cabem
em uma só máquina
byFernandoStankuns
5. COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Você está falando mais em Terabytes
do que em Gigabytes
6. COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
A quantidade de dados que você processa cresce
constantemente. E deve dobrar no ano que vem.
bySauloCruz
13. Apache Crunch
Biblioteca para construção de MapReduce pipelines
sobre Hadoop
Intercala e orquestra diferentes funções de
MapReduce
De quebra, otimiza e facilita a implementação de
MapReduce
FlumeJava: Easy, Efficient Data-Parallel
Pipelines (Google, 2010)
14. Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable*
• PCollection, PTable ouPGroupedTable
Write
Data
Source
HDFS HDFS
Data
Target
15. Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable* Write
Data
Source
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
Data
Target
16. Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 1PCollection* Ptable* Write
Data
Source
Data
Target
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
DoFN 1 DoFN 2
17. Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* PTable* Write
Data
Source
Data
Target
parallelDo()
DoFN 1 DoFN 1
HDFS HDFS
• PCollection, PTable ouPGroupedTable
28. public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> {
public NaiveCountVisitors() {
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
URL url = new URL(parts[2]);
emitter.emit(Pair.of(url.getHost(), 1));
}
}
29. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text page = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
String[] parts = line.split(" ");
page.set(new URL(parts[2]).getHost());
context.write(page, one);
}
}
30. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable counter = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int count = 0;
for (IntWritable value : values) {
count = count + value.get();
}
counter.set(count);
context.write(key, counter);
}
}
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
51. *É big data, lembra?
Hadoop/HDFS não funcionam bem
com arquivos pequenos.
52. E se a fonte não for um arquivo texto?
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class)
.setDriverClass(org.h2.Driver.class)
.setUrl(”jdbc://…").setUsername(”root").setPassword("")
.selectSQLQuery("SELECT URL, UID FROM TEST")
.countSQLQuery("select count(*) from Test").build();
PCollection<Vistors> visitors= pipeline.read(dbsrc);
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
…
PipelineResult pipelineResult = pipeline.done();
53. Ou crie o seu datasource…
public class RedisSource<T extends Writable> implements Source<T> {
@Override
public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId)
throws IOException {
Configuration configuration = job.getConfiguration();
RedisConfiguration.configureDB(configuration,
redisMasters, dbNumber, dataStructure);
job.setInputFormatClass(RedisInputFormat.class);
RedisInputFormat.setInput(job, inputClass, redisMasters,
dbNumber, sliceExpression, maxRecordsToReturn,
dataStructure, null, null);
}
}
- Read
- getSplits
- Write
54. Tenha certeza que seu Big Data é
Big mesmo
Otimize o seu código. Ele será
repetido MUITAS vezes
Projete o seu pipeline para
maximizar o paralelismo
55. Na nuvem você tem recursos
virtualmente ilimitados.
Mas o custo também.
56. Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon
Editor's Notes
The bottleneck usually is caused by the amount of data going across the network
The bottleneck usually is caused by the amount of data going across the network