Functional Big Data (by Vance Shipley)

•

0 likes•266 views

This presentation was done by Vance Shipley (CTO at Wavenet) at the SLASSCOM Tech Talk - 'Smart Data Engineering' on 26th November 2014.

Agenda
MapReduce
Google
Scaling Out
Key Value Store
Chaining
Fault Tolerance
Functional Example
Business Problem
Design
Processes
Schema
Big Data Guidelines

Google MapReduce
+ Paper published in 2004
+ Implemented in 2003
+ Production use at Google
+ Built for Google
+ Not open sourced

Google in 2004
+ Clusters of 100s or 1000s of servers
o Linux
o dual-processor x86
o 2-4 GB memory
o 100BaseT or GigE
o inexpensive IDE hard drives
+ Servers fail every day
+ Network maintenance is constant

Scaling Out
+ Scaling up (faster computer) doesn’t get far
+ Scaling out is the only next step
+ Hundreds/thousands of modest computers
outperform the biggest single computers
+ Scaling one to a few is hard
+ Scaling a few to many is easy
+ Scaling many to massive is (almost) trivial

Intermediate Data
+ Input data is split between the workers
+ Map workers create key/value pairs
+ Reduce workers read in all intermediate
data and sort by key
+ Reduce workers then iterate over the sorted
data producing a result for each key

Rinse and Repeat
+ Often the results of one MapReduce are
used as input to another
+ Building on a powerful basic functional
model complex data processing can be
accomplished

Fault Tolerance
+ Likelihood of failure rises with number of
servers and processing time
+ Resiliency is a necessity at scale
+ Scheduler/Supervisor (master) reassigns
failed jobs and ensures reduce workers find
the (right) data

Example Business Problem
Scenario:
A mobile operator wants to know if an instant
messaging (IM) service would be useful to
current subscribers.
Question:
What percentage of text messages (SMS)
are part of a conversation?

Challenge
✓ 10 million subscribers
✓ average of 100 SMS a month per subscriber
✓ ∴ one billion SMS each month
✓ call detail records (CDR) include SMS but also
voice and data events
✓ ∴ 20 billion (20,000,000,000) records/month

Requirements
+ Identify SMS conversations
o messages sent or received with one other party
o interval between messages < 10 minutes
o at least three messages exchanged
+ Provide result as
o ratio of conversational to non-conversational SMS
o per subscriber
o per month

Filter
+ Read events from CDR files
o records are in chronological order
o read files in chronological order
+ Discard non-SMS events
+ Distribute SMS events to Map processes
o Consistent distribution by subscriber

Hashing
+ To analyze interval between
messages one process must
handle all events for a
particular subscriber
+ Simple Hash:
o M = last four digits of subscriber’s
mobile number
o N = number of processes available
o Pid = M rem N

Map
+ Read subscriber’s stored data
+ Find other party in set
+ Increment total count of messages
+ Is previous message < 10 minutes?
o Is next previous message < 10m before previous?
 Increment conversational messages count
+ Update previous and next previous times

Interim Data
+ We are using an in memory key value store
+ The key is the subscriber number
+ The value is a set of OtherParty
+ OtherParty data structure contains counts
+ When the map is complete we transfer the
data to disk for persistence

Reduce
+ Collect intermediate data
from disk copies
+ Iterate through all parties for
each subscriber
+ Total all party counts
+ Provide result as percentage
of conversational messages
to total messages

Big Data Guidelines
+ Find opportunities for concurrency
+ Choose the right containers for your data
+ Use memory as effectively as possible
+ Minimize copying data
+ Avoid any unnecessary overhead
+ Anything you are going to do hundreds of
billions of times should be efficient!

SLASSCOM TECH TALKS
https://www.facebook.com/SlasscomTechnologyForum
http://www.slasscom.lk/events
https://twitter.com/slasscom
www.slideshare.net/slasscomtechforum

This document provides an overview of big data and Hadoop. It discusses the scale of big data, noting that Facebook handles 180PB per year and Twitter handles 1.2 million tweets per second. It also covers the volume, variety, and velocity challenges of big data. Hadoop and MapReduce are introduced as the leading solutions for distributed storage and processing of big data using a scale-out architecture. Key ideas of Hadoop include storing large data across multiple machines in HDFS and processing that data in parallel using MapReduce jobs.

Big data

Sampath Bhargav Pinnam

“BIG DATA” is data that is big in volume velocity and Variety “TODAY’S BIG MAY BE TOMMOROW’S NORMAL” Varieties deals with a wide range of data types Structured data - RDMS Semi – structured data – HTML,XML Unstructured data – audios, videos, emails, photos, pdf, social media hadoop It was created by DOUG CUTTING and MICHEAL CAFARELLA in 2005 2003 – NUTCH open source search engine( lucene ,sphinx ,etc…) (google published some papers mentioning about DFS and MAP REDUCE) After yahoo took this initiative step Then the creation of hadoop took place Hadoop 0.1.0 was relesed april 2006 As of now hadoop 2.8 is available

Validating Non Functional Requirements

Reuben Korngold

The document discusses the importance of non-functional requirements (NFRs) in software development. It notes that NFRs such as performance, reliability, and usability must be defined, tested, and validated throughout the development lifecycle. Ignoring NFRs can negatively impact the cost, timeline, and ultimate success of a project. The document provides examples of different types of NFRs and urges considering stakeholders' perspectives to prioritize the most important NFRs to test.

Requirements document for big data use cases

Allied Consultants

This document is the first deliverable of the Lean Big Data work package 7 (WP7). The main goal of the package 7 is to provide the use cases applications that will be used to validate the Lean Big Data platform. To this end, an analysis of requirement of each use case will be provided in the scope.This analysis will be used as basis for the description of the evaluation, benchmarking and validation of the Lean Big Data platform. This deliverable comprises the analysis of requirements for the following case of study provided in the context of Lean Big Data: Data Centre monitoring Case Study, Electronic Alignment of Direct Debit transactions Case Study, Social Network-based Area surveillance Case Study and Targeted Advertisement Case Study.

Simply Business' Data Platform

Dani Solà Lagares

Big Data - An Overview

Arvind Kalyan

The Double win business transformation and in-year ROI and TCO reduction

MongoDB

This document discusses how modern information management with flexible data platforms like MongoDB can help businesses transform and drive ROI through cost reduction and increased productivity compared to legacy systems. It provides examples of strategic areas where MongoDB can modernize an organization's full technology stack from data in motion/at rest to apps, compute, storage and networks. Success stories show how MongoDB has helped companies like Barclays reduce costs and complexity while improving resiliency, agility and innovation.

Big data & Hadoop

Ahmed Gamil

This document discusses big data and Hadoop. It defines big data as large data sets that cannot be processed by traditional software tools within a reasonable time frame due to the volume and variety of data. It then describes the three V's of big data - volume, velocity, and variety. The document provides examples of sources of big data and discusses how Hadoop, an open-source software framework, can be used to manage and analyze big data through its core components - HDFS for storage and MapReduce for processing. Finally, it provides a high-level overview of how MapReduce works.

NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.

Big Data with Hadoop – For Data Management, Processing and Storing

IRJET Journal

This document discusses big data and Hadoop. It begins with defining big data and explaining its characteristics of volume, variety, velocity, and veracity. It then provides an overview of Hadoop, describing its core components of HDFS for storage and MapReduce for processing. Key technologies in Hadoop's ecosystem are also summarized like Hive, Pig, and HBase. The document concludes by outlining some challenges of big data like issues of heterogeneity and incompleteness of data.

Transaction processing system

Ayisha Kowsar

Transaction Processing Systems (TPS) collect, store, and modify data from daily business transactions. TPS have features like rapid response, reliability, and inflexibility as they treat all transactions equally. There are two main types of TPS - batch processing, where data is collected and processed later, and real-time processing, where data is processed immediately. Data warehouses are large databases used to support management decision making through analysis of historical data from various sources.

Big data in Private Banking

Jérôme Kehrli

This document discusses opportunities for using big data in private wealth management. It begins by defining big data and describing how data volumes have increased exponentially. It then outlines several potential use cases for big data in areas like real-time performance metrics, portfolio optimization, and leveraging customer data. For each use case, it describes current limitations and how a big data approach could enable new capabilities. Finally, it proposes a phased approach for wealth managers to identify use cases, prioritize them, implement proofs of concept, and incrementally automate analysis and reporting. The overall message is that big data can enhance analytics and open up new opportunities previously only available to investment banks.

BigData Hadoop

Kumari Surabhi

This document summarizes a summer training seminar on BigData Hadoop that was attended. The training was provided by LinuxWorld Informatics Pvt Ltd, which offers open source and commercial training programs. The attendee learned about Hadoop, MapReduce, single and multi-node clusters, Docker, and Ansible. Big data challenges related to volume, variety, velocity, and veracity of data were also covered. Hadoop and its core components HDFS and MapReduce were explained as solutions for storing and processing large datasets in a distributed manner across commodity hardware. Docker containers were introduced as a lightweight alternative to virtual machines.

The BUsiness of Windows Azure Platform

Dan Moore

Big Data

Priyanka Tuteja

This document discusses big data, including what it is, common data sources, its volume, velocity and variety characteristics, solutions like Hadoop and its HDFS and MapReduce components, and the impact and future of big data. It explains that big data refers to large and complex datasets that are difficult to process using traditional tools. Hadoop provides a framework to store and process big data across clusters of commodity hardware.

Big Data Architecture

Guido Schmutz

This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.

Big Data Architectures @ JAX / BigDataCon 2016

Guido Schmutz

Mit der Architektur steht und fällt jedes IT-Projekt. Das gilt in noch stärkerem Maße für Big-Data-Projekte, denn hier konnten noch keine Standards über Jahrzehnte ihre Tauglichkeit beweisen. Dennoch verbreiten und etablieren sich auch hier gute und effektive Lösungen. Der Vortrag erklärt, welche Bausteine wichtig für die verschiedenen Einsatzmöglichkeiten im Big-Data-Umfeld sind, und wie sie in konkrete Lösungen gegossen werden können. Dabei beleuchtet er sowohl traditionelle Big-Data-Architekturen als auch aktuelle Ansätze, wie z. B. die Lambda- und die Kappa-Architektur. Ebenfalls ein Thema sind Stream-Processing-Infrastrukturen und ihre Kombination mit Big-Data-Technologien. Ausgehend von einer produkt- und technologieunabhängigen Referenzarchitektur stellt dieser Vortrag verschiedene Lösungsmöglichkeiten auf Basis von Open-Source-Komponenten vor.

Lean Enterprise, Microservices and Big Data

Stylight

This document discusses enabling the lean enterprise through technologies like microservices, continuous integration/deployment, and cloud computing. It begins by defining the lean enterprise and the OODA loop concept. It then explains how technologies like AWS, big data, and microservices can help organizations continuously observe, orient, decide, and act. Specific AWS services like EC2, EMR, Kinesis, Redshift, S3, and DynamoDB are reviewed. The benefits of breaking up monolithic systems into microservices and implementing devops practices like CI/CD are also summarized.

Introduction Big Data

Frank Kienle

Smart App@Pivotal by Dat Tran

VMware Tanzu Korea

This document discusses smart apps and how Pivotal uses data science to build them. It describes three key components of smart apps: data, a smart system that uses data science to understand user behavior, and a user interface. It then provides examples of smart apps Pivotal has developed for logistics and automotive customers, describing how machine learning models were used to predict delivery locations and road conditions. The document emphasizes an API-first approach and using cloud platforms like Cloud Foundry to operationalize models and deliver insights through predictive APIs.

A Big Data Concept

Dharmesh Tank

This document discusses the concept of big data. It defines big data as massive volumes of structured and unstructured data that are difficult to process using traditional database techniques due to their size and complexity. It notes that big data has the characteristics of volume, variety, and velocity. The document also discusses Hadoop as an implementation of big data and how various industries are generating large amounts of data.

IT overview for nonprofits by Dave Cortright (IT4NP)

Dave Cortright

Big Data

NGDATA

SplunkLive! Dallas Nov 2012 - Metro PCS

Splunk

Data Engineer's Lunch #85: Designing a Modern Data Stack

Anant Corporation

Cassandra & puppet, scaling data at $15 per month

daveconnors

Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)

MIT College Of Engineering,Pune

AquaQ Analytics Kx Event - Data Direct Networks Presentation

AquaQ Analytics

This document discusses using DDN's parallel file systems to improve the performance of kdb+ analytics queries on large datasets. Running kdb+ on a parallel file system can significantly reduce query latency by distributing data and queries across multiple file system servers. This allows queries to achieve near linear speedups as more servers are added. The shared namespace also allows multiple independent kdb+ instances to access the same consolidated datasets.

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18. - Accepted in ASPLOS ‘17, Xi’an, China. - Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17. - Invited for presentation at SoCal PLS ‘16. - Invited for poster presentation at PLDI SRC ‘16.

GreenCode-A-VSCode-Plugin--Dario-Jurisic

Green Software Development

Similar to Functional Big Data (by Vance Shipley)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData

Cloudera, Inc.

Big Data with Hadoop – For Data Management, Processing and Storing

IRJET Journal

Transaction processing system

Ayisha Kowsar

Big data in Private Banking

Jérôme Kehrli

BigData Hadoop

Kumari Surabhi

The BUsiness of Windows Azure Platform

Dan Moore

Big Data

Priyanka Tuteja

Big Data Architecture

Guido Schmutz

Big Data Architectures @ JAX / BigDataCon 2016

Guido Schmutz

Lean Enterprise, Microservices and Big Data

Stylight

Introduction Big Data

Frank Kienle

Smart App@Pivotal by Dat Tran

VMware Tanzu Korea

A Big Data Concept

Dharmesh Tank

IT overview for nonprofits by Dave Cortright (IT4NP)

Dave Cortright

Big Data

NGDATA

SplunkLive! Dallas Nov 2012 - Metro PCS

Splunk

Data Engineer's Lunch #85: Designing a Modern Data Stack

Anant Corporation

Cassandra & puppet, scaling data at $15 per month

daveconnors

Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)

MIT College Of Engineering,Pune

AquaQ Analytics Kx Event - Data Direct Networks Presentation

AquaQ Analytics

Similar to Functional Big Data (by Vance Shipley) (20)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData

Big Data with Hadoop – For Data Management, Processing and Storing

Transaction processing system

Big data in Private Banking

BigData Hadoop

The BUsiness of Windows Azure Platform

Big Data

Big Data Architecture

Big Data Architectures @ JAX / BigDataCon 2016

Lean Enterprise, Microservices and Big Data

Introduction Big Data

Smart App@Pivotal by Dat Tran

A Big Data Concept

IT overview for nonprofits by Dave Cortright (IT4NP)

Big Data

SplunkLive! Dallas Nov 2012 - Metro PCS

Data Engineer's Lunch #85: Designing a Modern Data Stack

Cassandra & puppet, scaling data at $15 per month

Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)

AquaQ Analytics Kx Event - Data Direct Networks Presentation

Recently uploaded

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

GreenCode-A-VSCode-Plugin--Dario-Jurisic

Green Software Development

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

Łukasz Chruściel

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Neo4j

Measures in SQL (SIGMOD 2024, Santiago, Chile)

Julian Hyde

SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries. SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL. To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context. A talk at SIGMOD, June 9–15, 2024, Santiago, Chile Authors: Julian Hyde (Google) and John Fremlin (Google) https://doi.org/10.1145/3626246.3653374

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

Peter Muessig

What is Augmented Reality Image Tracking

pavan998932

socradar-q1-2024-aviation-industry-report.pdf

SOCRadar

SOCRadar's Aviation Industry Q1 Incident Report is out now! The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers. SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Peter Muessig

The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App

Google

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-fusion-buddy-review AI Fusion Buddy Review: Key Features ✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini ✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique! ✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs! ✅Fully automated AI articles bulk generation! ✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more. ✅With one keyword or URL, generate complete websites, landing pages, and more… ✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7. ✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches. ✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all! ✅Save over $5000 per year and kick out dependency on third parties completely! ✅Brand New App: Not available anywhere else! ✅ Beginner-friendly! ✅ZERO upfront cost or any extra expenses ✅Risk-Free: 30-Day Money-Back Guarantee! ✅Commercial License included! See My Other Reviews Article: (1) AI Genie Review: https://sumonreview.com/ai-genie-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIFusionBuddyReview, #AIFusionBuddyFeatures, #AIFusionBuddyPricing, #AIFusionBuddyProsandCons, #AIFusionBuddyTutorial, #AIFusionBuddyUserExperience #AIFusionBuddyforBeginners, #AIFusionBuddyBenefits, #AIFusionBuddyComparison, #AIFusionBuddyInstallation, #AIFusionBuddyRefundPolicy, #AIFusionBuddyDemo, #AIFusionBuddyMaintenanceFees, #AIFusionBuddyNewbieFriendly, #WhatIsAIFusionBuddy?, #HowDoesAIFusionBuddyWorks

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Quickdice ERP

Unveiling the Advantages of Agile Software Development.pdf

brainerhub1

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Neo4j

E-commerce Development Services- Hornet Dynamics

Hornet Dynamics

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样

mz5nrf0n

原版一模一样【微信：741003700 】【美国纽约州立大学奥尔巴尼分校毕业证学位证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Transform Your Communication with Cloud-Based IVR Solutions

TheSMSPoint

Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Crescat

Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry. Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events. With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use. Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements. If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

Recently uploaded (20)

Graspan: A Big Data System for Big Code Analysis

GreenCode-A-VSCode-Plugin--Dario-Jurisic

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Measures in SQL (SIGMOD 2024, Santiago, Chile)

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

What is Augmented Reality Image Tracking

socradar-q1-2024-aviation-industry-report.pdf

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Unveiling the Advantages of Agile Software Development.pdf

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

E-commerce Development Services- Hornet Dynamics

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样

Transform Your Communication with Cloud-Based IVR Solutions

Empowering Growth with Best Software Development Company in Noida - Deuglo

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Oracle Database 19c New Features for DBAs and Developers.pptx

Functional Big Data (by Vance Shipley)

1. Functional Big Data

2. Agenda MapReduce Google Scaling Out Key Value Store Chaining Fault Tolerance Functional Example Business Problem Design Processes Schema Big Data Guidelines

3. MapReduce

4. Google MapReduce + Paper published in 2004 + Implemented in 2003 + Production use at Google + Built for Google + Not open sourced

5. Google in 2004 + Clusters of 100s or 1000s of servers o Linux o dual-processor x86 o 2-4 GB memory o 100BaseT or GigE o inexpensive IDE hard drives + Servers fail every day + Network maintenance is constant

6. Scaling Out + Scaling up (faster computer) doesn’t get far + Scaling out is the only next step + Hundreds/thousands of modest computers outperform the biggest single computers + Scaling one to a few is hard + Scaling a few to many is easy + Scaling many to massive is (almost) trivial

7. Concurrency

8. Intermediate Data + Input data is split between the workers + Map workers create key/value pairs + Reduce workers read in all intermediate data and sort by key + Reduce workers then iterate over the sorted data producing a result for each key

9. Key Value Store

10. Rinse and Repeat + Often the results of one MapReduce are used as input to another + Building on a powerful basic functional model complex data processing can be accomplished

11. Chaining

12. Fault Tolerance + Likelihood of failure rises with number of servers and processing time + Resiliency is a necessity at scale + Scheduler/Supervisor (master) reassigns failed jobs and ensures reduce workers find the (right) data

13. Scheduling

14. Supervision

15. Functional Example

16. Example Business Problem Scenario: A mobile operator wants to know if an instant messaging (IM) service would be useful to current subscribers. Question: What percentage of text messages (SMS) are part of a conversation?

17. Challenge ✓ 10 million subscribers ✓ average of 100 SMS a month per subscriber ✓ ∴ one billion SMS each month ✓ call detail records (CDR) include SMS but also voice and data events ✓ ∴ 20 billion (20,000,000,000) records/month

18. Requirements + Identify SMS conversations o messages sent or received with one other party o interval between messages < 10 minutes o at least three messages exchanged + Provide result as o ratio of conversational to non-conversational SMS o per subscriber o per month

19. Process Design

20. Filter + Read events from CDR files o records are in chronological order o read files in chronological order + Discard non-SMS events + Distribute SMS events to Map processes o Consistent distribution by subscriber

21. Hashing + To analyze interval between messages one process must handle all events for a particular subscriber + Simple Hash: o M = last four digits of subscriber’s mobile number o N = number of processes available o Pid = M rem N

22. Map + Read subscriber’s stored data + Find other party in set + Increment total count of messages + Is previous message < 10 minutes? o Is next previous message < 10m before previous?  Increment conversational messages count + Update previous and next previous times

23. Schema Design

24. Interim Data + We are using an in memory key value store + The key is the subscriber number + The value is a set of OtherParty + OtherParty data structure contains counts + When the map is complete we transfer the data to disk for persistence

25. Reduce + Collect intermediate data from disk copies + Iterate through all parties for each subscriber + Total all party counts + Provide result as percentage of conversational messages to total messages

26. Big Data Guidelines + Find opportunities for concurrency + Choose the right containers for your data + Use memory as effectively as possible + Minimize copying data + Avoid any unnecessary overhead + Anything you are going to do hundreds of billions of times should be efficient!

27. Thank you.

28. SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum

Editor's Notes

In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
… or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
This is a contrived example.
That’s billion with a ‘B’. In Canada that’s 1,000 million.
There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.

Functional Big Data (by Vance Shipley)

Recommended

Recommended

More Related Content

Similar to Functional Big Data (by Vance Shipley)

Similar to Functional Big Data (by Vance Shipley) (20)

Recently uploaded

Recently uploaded (20)

Functional Big Data (by Vance Shipley)

Editor's Notes