Pig is an open-source dataflow system that allows users to analyze large datasets through a high-level language called Pig Latin. It sits on top of Hadoop and compiles Pig Latin queries into MapReduce jobs. Pig provides simple operations for data manipulation like filtering, grouping, joining, and generating new columns. It is commonly used by companies like Yahoo, Twitter, and LinkedIn to process web logs, build user behavior models, and perform other large-scale data analysis tasks.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
In this session you will learn:
Hadoop Data Types
Hadoop MapReduce Paradigm
Map and Reduce Tasks
Map Phase
MapReduce: The Reducer
IOException & JobConf
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
Jayesh Thakrar, Senior Systems Engineer, Conversant
The venerable HBase shell is often regarded as a simple utility to perform basic DDL and maintenance activities. However, it is in fact a powerful, interactive programming environment, primarily due to the JRuby engine under the covers. In this presentation, I'll describe its JRuby heritage and show some of the things that can be done with the "ird" (interactive ruby shell), as well as show how to exploit JRuby and Java integration via concrete working examples. In addition, I will demonstrate how the "shell" can be used in Hadoop streaming to quickly perform complex and large volume batch jobs.
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
In this session you will learn:
Hadoop Data Types
Hadoop MapReduce Paradigm
Map and Reduce Tasks
Map Phase
MapReduce: The Reducer
IOException & JobConf
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
Jayesh Thakrar, Senior Systems Engineer, Conversant
The venerable HBase shell is often regarded as a simple utility to perform basic DDL and maintenance activities. However, it is in fact a powerful, interactive programming environment, primarily due to the JRuby engine under the covers. In this presentation, I'll describe its JRuby heritage and show some of the things that can be done with the "ird" (interactive ruby shell), as well as show how to exploit JRuby and Java integration via concrete working examples. In addition, I will demonstrate how the "shell" can be used in Hadoop streaming to quickly perform complex and large volume batch jobs.
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
GNU poke is a new interactive editor for binary data. Not limited to editing basic ntities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. Once a user has defined a structure for binary data (usually matching some file format) she can search, inspect, create, shuffle and modify abstract entities such as ELF relocations, MP3 tags, DWARF expressions, partition table entries, and so on, with primitives resembling simple editing of bits and bytes. The program comes with a library of already written descriptions (or “pickles” in poke parlance) for many binary formats.
GNU poke is useful in many domains. It is very well suited to aid in the development of programs that operate on binary files, such as assemblers and linkers. This was in fact the primary inspiration that brought me to write it: easily injecting flaws into ELF files in order to reproduce toolchain bugs. Also, due to its flexibility, poke is also very useful for reverse engineering, where the real structure of the data being edited is discovered by experiment, interactively. It is also good for the fast development of prototypes for programs like linkers, compressors or filters, and it provides a convenient foundation to write other utilities such as diff and patch tools for binary files.
This talk (unlike Gaul) is divided into four parts. First I will introduce the program and show what it does: from simple bits/bytes editing to user-defined structures. Then I will show some of the internals, and how poke is implemented. The third block will cover the way of using Poke to describe user data, which is to say the art of writing “pickles”. The presentation ends with a status of the project, a call for hackers, and a hint at future works.
Jose E. Marchesi
From session at http://www.lambdalounge.org.uk/ on 18th April 2016. Here's the original blurb:
So, Haskell is "an advanced purely-functional programming language" which supports writing "declarative, statically typed code". It may be optimized for academic buzzwords you've never heard of but... is it any good for writing code in the way that you'd write Perl, Python, or Ruby?
What are strong types, and why are we so frightened of them anyway? Can you develop interactively in Haskell, the way you would in a dynamic language?
Does Haskell have "whipuptitude" (being able to get things done quickly) as well as "manipulexity" (being able to manipulate complex things)? And perhaps most importantly, can writing Haskell be *fun*?
Haskell is founded on decades of the finest mathematical and computer science research. Perl, quite demonstrably isn't... but why do so many Perl programmers also love Haskell?
Audrey Tang wrote the first prototype for Perl 6, Pugs, in Haskell, and coined the phrase "lambdacamel" for the substantial crossover between the languages.
What does a Perl programmer make of Haskell? What are the lessons that can be learned (in either direction). And do the languages have more in common than you might have thought?
An overview of two types of graph databases: property databases and knowledge/RDF databases, together with their dominant respective query languages, Cypher and SPARQL. Also a quick look at some property DB frameworks, including TinkerPop and its query language, Gremlin.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
2. What is Pig:
• Pig is an open-source high-level dataflow
system.
• It provides simple language for queries and
data manipulation called Pig Latin.
• Internally it is complied into a Map-Reduce job
that are run on Hadoop.
• Similar to SQL query where the user specifies
the “What” and leaves the “How” to the
underlying processing engine.
3. Pig in Hadoop Eco System:
Pig sits on top of Map-Reduce layer.
4. Pig v/s Map-Reduce:
Map-Reduce Pig
MR is a compiled language. Pig is Scripting Language.
Java knowledge is needed. Java knowledge is not required, only may
be to write your own UDF.
Lots of hand coding. Pig uses already defined SQL like
functions or extend already existing UDFs.
Users much more comfortable to use MR
when dealing with the total Un-
Structured data.
Pig has problems dealing with the Un-
Structured data like Images, Videos, etc.
5. Who all are using PIG:
• 70% of production jobs at Yahoo (10ks per
day)
• Yahoo, Twitter, LinkedIn, Ebay, AOL,…
• Used to
– Process web logs
– Build user behavior models
– Build maps of the web
– Do research on large data sets
6. Accessing Pig:
• There are two modes in which we can access
Pig:
1) Local Mode: To run Pig in local mode, you
need access to a single machine.
2) Hadoop (Map-Reduce) Mode : To run Pig in
hadoop (map-reduce) mode, you need access to
a Hadoop cluster and HDFS installation.
7. Running Ways:
• Grunt Shell: Enter Pig commands manually using Pig’s interactive shell,
Grunt.
e.g: $ pig -x <local or mapreduce>
grunt>
• Script File: Place Pig commands in a script file and run the script.
e.g: $ pig -x <local or mapreduce> my_script.pig
• Embedded Program: Embed Pig commands in a host language and run
the program.
e.g: $ java -cp pig.jar:. Idlocal
$ java -cp pig.jar:.:$HADOOPDIR idhadoop
Note: ‘-x mapreduce’ keyword is optional if we want to run in the Hadoop
mode. Example: $ pig –x mapreduce is same as $ pig. Or
$ pig –x mapreduce my_script.pig is same as $ pig my_script.pig.
8. Data Types:
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or
10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararray Character array (string) in Unicode UTF-8 format hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
Complex Types
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [name#John,phone#5551212]
9. Pig Execution:
• Pig scripts/commands follow the pattern as
given below:
Load
(Text, CSV, JSON,
Hive table.)
Transform
(Filter, Group,
Sort)
Store
(Dump, Store into
HDFS, Hive)
10. Loading Data in Pig:
• A = LOAD 'student' ;
• file_load = LOAD ‘/usr/tmp/student.txt' ;
• Z = LOAD 'student' USING PigStorage() AS (name : chararray, age : int, gpa : float);
• A = LOAD 'data' AS (f1 : int, f2 : int, B: bag {T : tuple (t1 : int, t2 : int)});
-- A / file_load / Z , here are called Relations.
-- LOAD, keyword used for loading data from HDFS into the Relation for processing / transformation.
-- ‘student’, The name of the file or directory, in single quotes. We can give full path name or
file_name* for all the similar filenames to be loaded.
-- USING is a function/keyword.
-- PigStorage() / TextLoader() / JsonLoader() / HCatloader(), we need to use appropriate function in
order for Pig to understand the incoming data. These are case Sensitive.
PigStorage() defaults to TAB separated data. If different, we need to specify the separator between the
parenthesis, example: PigStorage(‘,’), PigStorage(‘t’)
-- AS, is a keyword.
-- (name : chararray, ..), is called Schema.
11. Accessing the Relation
• Once the data is loaded into a Relation, there are two ways we can
access the data.
(1) Positional
(2) Schema names.
In the first example, the columns needs to be accessed by position as
the schema is not defined. The notation starts with $0 for the first
column, $1 for the second column and so on so forth.
In the next example, the schema is defined in terms of column names.
We can use either $0, $1 notation or we can use the column name as
is.
grunt> DESCRIBE A;
-- Does not produce any output since its Schema-less.
grunt> DESCRIBE Z;
Z: {name : chararray, age : int, gpa : float}
13. Relational Operators in Pig:
• The ones which are important to know are:
1) FILTER,
2) GROUP BY/ COGROUP BY,
3) LIMIT,
4) ORDER BY,
5) JOIN,
6) DISTINCT,
7) FOREACH GENERATE.
14. Relational Operator Examples:
-- Filter : It is similar to WHERE clause in SQL.
grunt> A = LOAD 'data' AS (f1 : int, f2 : int, f3 : int);
grunt> X = FILTER A BY f3 == 3;
grunt> Y = FILTER A BY (f1 == 8) OR (F2 == 10);
-- Group By / CoGroup By : GROUP BY is used for grouping the schemas in a single relation.
Where as, when we need to group two or more relations we need to use COGROUP BY.
grunt> A = load 'student' AS (name: chararray, age: int, gpa: float);
grunt> B = GROUP A BY age;
grunt> A = LOAD 'data1' AS (owner : chararray, pet : chararray);
grunt> DUMP A;
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
(Bob,dog)
(Bob,cat)
grunt> B = LOAD 'data2' AS (friend1 : chararray, friend2 : chararray);
grunt> DUMP B;
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
grunt> X = COGROUP A BY owner, B BY friend2;
Output:
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
15. -- Join : Essentially GROUP and JOIN operators perform similar functions. GROUP
creates a nested set of output tuples while JOIN creates a flat set of output tuples.
• Types of Joins : Inner, Outer (Left, Right, Full), Replicated, Merge, Skewed.
• Examples:
grunt> A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt> DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> B = LOAD 'data2' AS (b1:int,b2:int);
grunt> DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
grunt> X = JOIN A BY a1, B BY b1;
grunt> DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9) …
16. • FOREACH .. GENERATE: Generates data transformations based on
columns of data.
• Generally it follows after Join, Group, Filter operators or Load, if
you want to work with only a select few columns.
• Example:
grunt> A = LOAD 'data' AS (f1:int,f2:int,f3:int);
grunt> DUMP A;
grunt> Y = FOREACH A GENERATE *; -- this will print the Relation A as is with all cols.
grunt> B = GROUP A BY f1;
grunt> DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
grunt> X = FOREACH B GENERATE group, COUNT(A) AS total;
(1,1)
(4,2)
(7,1)
(8,2)
Here ‘group’ is the first col of the
Grouped output and is named
implicitly by Pig. It points to the
values 1,4,7 and 8.
17. -- Limit: Limits the number of output tuples. If the
specified number of output tuples is equal to or
exceeds the number of tuples in the relation, all
tuples in the relation are returned.
Example: grunt> X = LIMIT A 3;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Note: For Top N analysis, use ORDER BY (asc or desc)
and then Limit the output.
18. -- Distinct : Removes duplicate tuples in a relation.
grunt> A = LOAD 'data' AS (a1:int,a2:int,a3:int);
grunt> DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
grunt> X = DISTINCT A;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
-- Order By : Sorts a relation based on one or more fields. ORDER BY is NOT stable; if multiple
records have the same ORDER BY key, the order in which these records are returned is not
defined and is not guaranteed to be the same from one run to the next.
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> X = ORDER A BY a3 DESC;
grunt> DUMP X;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
19. Arithmetic Operators in Pig:
• We have the standard arithmetic operators which
Pig uses. They are:
1) Addition (+)
2) Subtraction (-)
3) Multiplication (*)
4) Division (/)
5) Modulo (%)
6) Bincond (? :) [(condition ? value_if_true : value_if_false)]
7) Case (CASE WHEN THEN ELSE END)
20. • Examples:
grunt> X = FOREACH A GENERATE f1, f2, f1+f2 AS f4;
grunt> X = FOREACH A GENERATE f2, (f2==1 ? 1: f3);
grunt> X = FOREACH A GENERATE f2,
( CASE WHEN f2 % 2 == 0 THEN 'even'
WHEN f2 % 2 == 1 THEN 'odd‘
END );
• The above CASE statement can be written as :
grunt> X = FOREACH A GENERATE f2,
( CASE f2 % 2 WHEN 0 THEN 'even'
WHEN 1 THEN 'odd'
END );
21. • Abs : Returns the absolute value of an expression.
Example: abs(int a), abs(float b)
• Ceil : Returns the value of the expression rounded up to
the nearest integer.
Example: ceil(4.6), ceil(1.0), ceil(-2.4)
• Floor : Returns the value of the expression rounded down
to the nearest integer.
Example: floor(4.6), floor(1.0), floor(-2.4)
• Round : Returns the value of an expression rounded to an
integer.
Example: round(4.6), round(1.0), round(-2.4)
• SQRT : Returns the positive square root of an expression
Example: SQRT(5)
Math Functions in Pig:
22. String Functions in Pig:
• Lower / Upper : Converts all characters in a string to lower /
upper case.
• LTRIM / RTRIM / TRIM : Returns a copy of a string with
leading / trailing / or both, white space removed.
• SUBSTRING : Returns a substring from a given string.
Syntax : SUBSTRING(string, startIndex, stopIndex)
Example : SUBSTRING(ABCDEF,1,4) => BCD. Here the start-
index starts with 0 and stop-index should be following the last
char we want.
• REPLACE : Replaces existing characters in a string with new
characters.
Syntax : REPLACE(string, 'oldChar', 'newChar');
23. Eval Functions in Pig:
• Usually the Eval functions operate on ‘bag’ datatype. So we need to
Group By before applying the functions.
• Count / Count_Star : Computes the number of elements in a bag. The
COUNT function ignores nulls. If you want to include NULL values in the
count computation, use COUNT_STAR. The output datatype will always be
of type Long.
Example : DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
X = FOREACH B GENERATE COUNT(A);
DUMP X;
(1L)
(2L)
(1L)
(2L)
24. • Min / Max : Computes the minimum / maximum of the numeric values or chararrays in a single-column bag. In
the below example the single-column is GPA.
Example :
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(John,4.0F)
(Mary,4.0F)
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
25. Storing Data from Pig :
• Store functions determine how the data comes out of pig.
• PigStorage() :
1. Stores data in UTF-8 format.
2. PigStorage is the default function for the STORE operator
and works with both simple and complex data types.
3. PigStorage supports structured text files (in human-
readable UTF-8 format).
4. The default field delimiter is tab ('t'). You can also specify
other characters as delimiters but within single quotes.
Example : STORE X INTO 'output' USING PigStorage('*');
26. • HCatStorer() :
1. HCatStorer is used with Pig scripts to write data to HCatalog-
managed tables ( Read : Hive).
2. To bring in the appropriate jars for working with HCatalog, simply
include the following flag / parameters when running Pig from
the shell:
pig –useHCatalog
3. The fully qualified package name is:
org.apache.hive.hcatalog.pig.HCatStorer
Example :
STORE processed_data INTO 'tablename' USING
org.apache.hive.hcatalog.pig.HCatStorer();
A = LOAD 'tablename' USING
org.apache.hive.hcatalog.pig.HCatLoader();
Link :
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadSto
re
27. User Defined Functions ( UDF ) :
• If the requirement is such that which cannot be fulfilled by
already existing operators / functions, then the user has an
option to write his own.
• Pig provides extensive support for user defined functions
(UDFs) as a way to specify custom processing.
• Pig UDFs can currently be implemented in three languages:
Java, Python, and JavaScript.
• You can customize all parts of the processing including data
load/store, column transformation, and aggregation.
• Pig also provides support for Piggy Bank, a repository for JAVA
UDFs. Through Piggy Bank you can access Java UDFs written
by other users and also contribute your java UDFs that you
have written.
• Please explore this Piggy Bank option before writing your own
Function as someone might already had coded for it.
28. Pig Example:
Word Count in Pig:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
TOKENIZE:
({(This),(is),(a),(hadoop),(class)})
({(hadoop),(is),(a),(bigdata),(technology)})
Flatten :
(This)
(is)
(a)
(hadoop)
(class) ….
29. Summary :
• Pig is an open-source high-level language.
• It sits above Map Reduce to simplify coding.
• Three main blocks of processing data :
– Load
– Transform
– Store.
• Pig can Load and Store from different sources
like DFS, Hive, etc.
• User can write UDFs to extend the functionality.