This document discusses big data processing frameworks and presents Port, a MapReduce framework for Pharo. It describes how Port allows executing MapReduce jobs locally or on Hadoop clusters and provides a Spark-like API. The document demonstrates Port by implementing a word counting application and executing it on sample poll data to count votes by region in a distributed manner. Finally, it discusses potential uses of Port and ideas for live debugging of distributed big data applications.
@healthmap shared detailed anonymised data. Filter England data to match Local Authority boundaries. Pivot-table them into time-series attributes per Local Authority. Use @kennethfield coxcombs in ArcGIS. Parse the class_ field into dates. Post the newly time-enabled coxcombs to fan out and show infectuous arrivals. Step-by-step starts slide 11.
PNM's GIS department has been using FME Desktop (Workbench) since 2002, when they hired Safe to create some workspaces for a major gas data conversion project. The PNM employee who worked with Safe realized FME's potential and started spreading the word. Today, in 2016, several GIS professionals at the company employ FME, using it for various ends. These include data conversion, extracts, reports, analyses, data quality and improvement, and error checking. This presentation will provide a brief overview of the many ways we use FME at PNM, exploring a couple of projects in greater detail. Presented by Aaron Allen and Rene Carrillo from PNM.
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
Keywords: Hadoop, Big Data, Hive, Azure
@healthmap shared detailed anonymised data. Filter England data to match Local Authority boundaries. Pivot-table them into time-series attributes per Local Authority. Use @kennethfield coxcombs in ArcGIS. Parse the class_ field into dates. Post the newly time-enabled coxcombs to fan out and show infectuous arrivals. Step-by-step starts slide 11.
PNM's GIS department has been using FME Desktop (Workbench) since 2002, when they hired Safe to create some workspaces for a major gas data conversion project. The PNM employee who worked with Safe realized FME's potential and started spreading the word. Today, in 2016, several GIS professionals at the company employ FME, using it for various ends. These include data conversion, extracts, reports, analyses, data quality and improvement, and error checking. This presentation will provide a brief overview of the many ways we use FME at PNM, exploring a couple of projects in greater detail. Presented by Aaron Allen and Rene Carrillo from PNM.
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
Keywords: Hadoop, Big Data, Hive, Azure
Alexander Fölling, Christian Grimme,
Joachim Lepping, and Alexander Papaspyrou: The Gain of Resource Delegation in Distributed Computing Environments
15th Workshop on Job Scheduling for Parallel Processing ; April 23, 2010 - Atlanta, GA, USA
Learn how to get started making Leaflet maps through R statistical software. Sounds crazy? Maybe. But the R package leafletR allows people familiar with R, but maybe not so much with HTML and Javascript coding, to make a basic Leaflet map (interactive, slipping web map) quickly with minimal knowledge of other programming languages.
Example code posted here: https://github.com/MicheleTobias/RCode
This talk will introduce a flexible accounting framework with data visualization capabilities called MICHAL, that we at CESNET developed for our infrastructure. Framework is able to gather data from multiple sources, OpenNebula being one of them, process it and present the result in a form of charts. MICHAL isn't bind to only one platform and can be easily extended to support accounting of multiple parts of the infrastructure. As part of the presentation, we will discuss our data gathering techniques, MICHAL's design and functionality, currently available data processing modules for IaaS cloud and plans for the future development.
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
This talk demonstrates some of the benefits of using R to visualize spatial data efficiently and clearly.
It was originally presented by Guy Lansley (UCL and the Consumer Data Research Centre) to the GIS for Social Data and Crisis Mapping Workshop at the University of Kent.
On-the-fly Integration of Static and Dynamic Linked Dataaharth
Slides of COLD 2013 Paper "On-the-fly Integration of Static and Dynamic Linked Data", Andreas Harth (KIT), Craig Knoblock (USC), Steffen Stadtmüller (KIT), Rudi Studer (KIT), Pedro Szekely (USC)
An introduction to using pgRouting with Ordnance Survey Open Data in QGIS with the pgRouting Layer plugin. Presentation given at the 3rd QGIS UK user group meeting in Edinburgh on 5th May 2015.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Alexander Fölling, Christian Grimme,
Joachim Lepping, and Alexander Papaspyrou: The Gain of Resource Delegation in Distributed Computing Environments
15th Workshop on Job Scheduling for Parallel Processing ; April 23, 2010 - Atlanta, GA, USA
Learn how to get started making Leaflet maps through R statistical software. Sounds crazy? Maybe. But the R package leafletR allows people familiar with R, but maybe not so much with HTML and Javascript coding, to make a basic Leaflet map (interactive, slipping web map) quickly with minimal knowledge of other programming languages.
Example code posted here: https://github.com/MicheleTobias/RCode
This talk will introduce a flexible accounting framework with data visualization capabilities called MICHAL, that we at CESNET developed for our infrastructure. Framework is able to gather data from multiple sources, OpenNebula being one of them, process it and present the result in a form of charts. MICHAL isn't bind to only one platform and can be easily extended to support accounting of multiple parts of the infrastructure. As part of the presentation, we will discuss our data gathering techniques, MICHAL's design and functionality, currently available data processing modules for IaaS cloud and plans for the future development.
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
This talk demonstrates some of the benefits of using R to visualize spatial data efficiently and clearly.
It was originally presented by Guy Lansley (UCL and the Consumer Data Research Centre) to the GIS for Social Data and Crisis Mapping Workshop at the University of Kent.
On-the-fly Integration of Static and Dynamic Linked Dataaharth
Slides of COLD 2013 Paper "On-the-fly Integration of Static and Dynamic Linked Data", Andreas Harth (KIT), Craig Knoblock (USC), Steffen Stadtmüller (KIT), Rudi Studer (KIT), Pedro Szekely (USC)
An introduction to using pgRouting with Ordnance Survey Open Data in QGIS with the pgRouting Layer plugin. Presentation given at the 3rd QGIS UK user group meeting in Edinburgh on 5th May 2015.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
It is required that after the course study
you should:
Have a general concept about DT
Master Panorama DT operation
Master Panorama data analysis
Chapter 1 DT Introduction
Chapter 2 Panorama DT Introduction
Chapter 3 Panorama DT Data Analysis
Collect System Air interface data
Analyze Air interface data
Assist Export Analysis report
Qualcom CAIT
CDMA Air Interface Tester
WILL TECH DM2K/Pecker
Pecker Navigator, Pecker Analyzer
Panorama
Qualcom CAIT
CDMA Air Interface Tester
WILL TECH DM2K/Pecker
Pecker Navigator, Pecker Analyzer
Panorama
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...Absi Ahmed
Scaling map reduce applications across hybrid clouds to meet soft deadlines - By Michael Mattess, Rodrigo N. Calheiros, and Rajkumar Buyya, Proceedings of the 27th IEEE International Conference on Advanced Information Networking and Applications (AINA 2013, IEEE CS Press, USA), Barcelona, Spain, March 25-28, 2013.
Project is focused on analyzing big data available on parking tickets in LA open data portal. To analyze the available Big data, Big data processing environment with the combination of Hadoop distributed environment, Map- Reduce programming using Pig and Java, SQL: MySQL and NoSQL: HBase databases for data storage is used.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
Kudu as Storage Layer to Digitize Credit ProcessesOlaf Hein
With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
The slides are from my talk at DataWorks Summit 2019 #DWS19 in Barcelona.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...IMGS
Aim: "To seek out innovative FME users
throughout the galaxy, sharing
their stories and ideas to inspire
you to take your data where no
data has gone before."
If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
2. Matteo Marra - Big Data Processing in Pharo
Big Data Frameworks
2
Programming Model
To express large data processing jobs in terms of simple
functions (e.g., Map/Reduce)
Parallel Execution Model
To execute in parallel computation on large amount of data
3. Matteo Marra - Big Data Processing in Pharo
Why Big Data?
Google MapReduce
3
Index the WWW
More than 20 billion web pages (> 400 TB)
Do it fast
With 1000 machines: < 3 hours, with one machine, 4 months.
Keep it cheap
Use a lot of recycled hardware, but with robust error recovery
MapReduce: Simplified Data
Processing on Large Clusters
Dean J. and Ghemawat S. OSDI ‘04
4. Matteo Marra - Big Data Processing in Pharo
Rossi Abruzzo 1522706400
Verdi Toscana 1527976800
Bianchi Toscana 1525298400
Verdi Lombardia 1527976800
Gialli Toscana 1525298400
Rossi Abruzzo 1525298400
Gialli Veneto 1527976800
Bianchi Sardegna 1522706400
Gialli Toscana 1527976800
Bianchi Lombardia 1527976800
Bianchi Toscana 1522706400
Gialli Sicilia 1522706400
Gialli Sicilia 1527976800
Bianchi Sardegna 1527976800
Dataset KeyBy
(Rossi, Abruzzo …)
(Verdi, Toscana …)
(Bianchi, Toscana …)
(Verdi, Lombardia …)
(Gialli, Toscana …)
(Rossi, Abruzzo …)
(Gialli, Veneto …)
(Bianchi, Sardegna …)
(Gialli, Toscana …)
(Bianchi, Lombardia
…)
(Bianchi, Toscana …)
(Gialli, Sicilia …)
(Gialli, Sicilia …)
(Bianchi, Sardegna …)
Map
NIL
(Verdi, Toscana …)
(Bianchi, Toscana …)
NIL
(Gialli, Toscana …)
NIL
NIL
NIL
(Gialli, Toscana …)
NIL
(Bianchi, Toscana …)
NIL
NIL
NIL
GroupBy
(Verdi, Toscana)
(Bianchi, Toscana …)
(Bianchi, Toscana …)
(Gialli, Toscana …)
(Gialli, Toscana …)
Reduce
(Verdi, 1)
(Bianchi, 2)
(Gialli, 2)
Map/Reduce by Example
4
6. Matteo Marra - Big Data Processing in Pharo
Port: a Map/Reduce Framework
for Pharo
6
Apply Map and ReduceCoordinate
7. Matteo Marra - Big Data Processing in Pharo
Deploying Port
7
Local
Deploy on one machine
using multiple cores.
Standalone
Deploy on multiple machines
in the same network.
Yarn Mode
Deploy on a cluster using
Hadoop YARN.
8. Matteo Marra - Big Data Processing in Pharo
Using Port
8
app := VoteCountingApp new.
port startMapReduceApplication: app on: data.
9. Matteo Marra - Big Data Processing in Pharo
Handling Results
9
VoteCountingApp>>handleResult: resultPair
self storeToHDFS: resultPair
Asynchronous Execution
The application is executed
asynchronously.
Asynchronous Result Handling
The result is handled asynchronously by the
Map/Reduce Master
10. Matteo Marra - Big Data Processing in Pharo
From Map/Reduce to Spark
10
Distributed Collection
Instead of executing Map/Reduce on a Dataset, you load
the dataset in memory
Broader API
Not only execute map or reduce, but a broad set of
methods applicable to the distributed collection
In-Memory operations
Avoid heavy storing of intermediate results by explicitly
keep them in memory
map:
reduce:
filter:
aggregate:
groupBy:
map: reduce:
11. Matteo Marra - Big Data Processing in Pharo
The Spark-like API in Port
11
port startMapReduceApplication: app
on: dataSet .
data := port distribute: dataSet.
VoteCountingApp>>map: aLine
…
result := data map: […].
result storeToHDFSPath: ‘…’ .
VoteCountingApp>>handleResult: res
self storeToHDFS: res
VoteCountingApp>>run
12. Matteo Marra - Big Data Processing in Pharo
The Polls Analyser in Port
(V2)
12
port := Port withServerURL: ‘http://myServerURL’.
data := port distribute: dataSet.
data := data filter: [:line | line includesSubstring: ‘Toscana’].
data map: [:line | splitted := line substrings: ' ' .
(DateAndTime fromUnixTime: splitted at: 3)
> DateAndTime yesterday
ifTrue: [ (splitted at: 2) -> 1 ]]
result := data reduceByKey: [:value :sum| sum + value]
1
2
3
4
5
6
7
8
result getCollection.
result storeToHDFSPath: ‘…’
14. Matteo Marra - Big Data Processing in Pharo
Uses of Port
14
Blockchain Analysis
In collaboration with Santiago at RMoD Lille.
Full blockchain in 6 hours vs 2 days!
Genetic algorithms
In collaboration with Alexandre Bergel at Universidad de Chile
15. Matteo Marra - Big Data Processing in Pharo
Towards Live Big Data
Development Tools
15
Online Debugging
Debug the system as the bug is happening
Global View
Centralised debugging of the distributed system
Update of the Running System
Deploy code-fixes without restarting the whole system
Isolation
Debug the system without interfering with its execution