Processing pipelines requirements change all the time. But porting a meaningful insight from Batch to Near Real Time can involve a lot of rework. In this introductory talk we will present the Beam Model and how it is implemented in Apache Beam, with specific examples meant to be used as recipes in your next developments. Finally we're going to showcase how to switch the underneath engine without changing a line of code.
Bio
Marc Gonzalez is a Freelance Data Engineer, associated with InnoIT Consulting. He has 5+ years of experience applying Big Data technologies to Classifieds World.
Presented in https://www.meetup.com/Barcelona-Apache-Beam-Meetup/events/266806458/
Do you want this talk privately delivered to your company? Visit https://hire.marcgonzalez.eu
2. InnoIT: a disruptive IT company!
INNOIT IS A CONSULTING COMPANY SPECIALISED IN IT
We work with:
Web & Mobile Development
Systems & DevOPS
Quality Assurance Testing
Big Data & Machine Learning
Methodologies: Agile, Lean, Product Owner
We are specialists!
9 YEARS IN FRANCE, 3 YEARS IN SPAIN:
In France, we are > 200 consultants
In Spain, we have built a team of > 35 consultants in 3 years
We reached the confidence of more than 25 multinational clients
We have organized >20 technological meetups with >900 attendees
3. We are simply different
COACHING OF OUR CONSULTANTS AS A “FOOTBALL AGENT SCOUT”: WE WORK “FROM THE CANDIDATE TO THE MARKET”.
OUR CLIENT IS THE CANDIDATE.
SOFT SKILLS: WE LOOK FOR THE BEST TALENTS WE CAN TRUST IN. WE FOCUS ON THE POTENTIAL OF A PERSON
TRANSPARENCY: WE ONLY SAY WHAT WE ARE GOING TO COMPLETE AND WE COMPLY OUR PROMISE
WE WON THE TRUST OF THE CANDIDATES:
1 / 2 CVS WE RECEIVED COMES FROM A “REFERRAL”
THE SAME EFFECT WITH OUR CLIENTS
OUR CONSULTANTS GIVE US REVIEWS & QUOTES ON SOCIAL MEDIA
5. We are hiring people like you!
YOU CAN A LOOK TO OUR OFFERS AND APPLICATE IN OUR WEBSITE:
WWW.INNO-IT.ES
YOU CAN ALSO SEND YOUR CV TO THE EMAIL:
APPLY@INNO-IT.ES
YOU CAN ALSO SIMPLY COME AND SPEAK WITH US ☺
MONIKA, MELISSA, GABRIEL AND MYSELF CAN EXPLAIN YOU OUR OPPORTUNITIES!
6. Next Event in InnoIT
JANUARY
MEETUP: “INTRODUCTION TO APACHE BEAM. APACHE BEAM AND HOW TO LEVERAGE UNIFIED PROCESSING TO
TACKLE NEW DEVELOPMENTS”?
MIÉRCOLES 29/01 DE 19H A 21H (OFICINA URQUINAONA)
ORGANIZADO POR INNOIT CONSULTING & UBEEQO/ EUROPCAR NEW MOBILITIES
FEBRUARY
DESAYUNO DE TRABAJO: “LEADERSHIP: CÓMO GESTIONAR CONFLICTOS EN EMPRESAS EN GRAN CRECIMIENTO?”
MARTES 11/02 DE 9H A 11H (OFICINA POBLE NOU)
ORGANIZADO POR INNOIT Y EL COACH (EX CTO) ALFONS FOUBERT
14. Q? sli.do #BEAM
Challenges
• Processing out-of-order data based on application timestamps (also called event time)
• Maintaining large amounts of state
• Supporting high-data throughput
• Processing each event exactly once despite machine failures
• Handling load imbalance and stragglers
• Responding to events at low latency
• Joining with external data in other storage systems
• Determining how to update output sinks as new events arrive
• Writing data transactionally to output systems
• Updating your application’s business logic at runtime
15. Q? sli.do #BEAM
TALK STRUCTURE
Beam model through Streams & Tables theory.
Getting started with Apache Beam.
Apache Beam Barcelona meetup.3
2
1
Q Q&A sli.do code: BEAM
16. Q? sli.do #BEAM
Notes
• Most material is from Tyler Akidau, either from his blog, talks or book.
17. Q? sli.do #BEAM
TALK STRUCTURE
Beam model through Streams & Tables theory.
Getting started with Apache Beam.
Apache Beam Barcelona meetup.3
2
1
18. Q? sli.do #BEAM
Beam model
• What results are calculated?
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do refinements of results relate?
23. Q? sli.do #BEAM
Beam model
• What results are calculated? Insights
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do refinements of results relate?
24. Q? sli.do #BEAM
Where in event time are results calculated?
Windowing
• Partitioning a data set along temporal boundaries.
Fixed Sliding Session
Event-Time
26. Q? sli.do #BEAM
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized?
• How do refinements of results relate?
27. Q? sli.do #BEAM
When in processing time are results materialized?
Triggers
• Mechanism for declaring when the output for a window should be
materialized (relative to some external signal).
• Per element
• Window completion
• Fixed
29. Q? sli.do #BEAM
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized? Triggers
• How do refinements of results relate?
30. Q? sli.do #BEAM
How do refinements of results relate?
State
• Amount of context stored between runs.
31. Q? sli.do #BEAM
How do refinements of results relate?
Watermarks
• Temporal notions of input completeness in the event-time domain.
33. Q? sli.do #BEAM
How do refinements of results relate?
Handling late data
• Firing functions when events are observed outside the state.
Technique Side-effect
Discarding Approximate
Accumulation Duplicates
Accumulation
& Retraction
Late updates
35. Q? sli.do #BEAM
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized? Triggers
• How do refinements of results relate? Watermarks & Exactly Once
36. Q? sli.do #BEAM
“Every Stream can yield a Table at a certain time,
& every Table can be observed into a Stream.”
Streams & Tables theory
43. Q? sli.do #BEAM
Batch+strEAM model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized? Triggers
• How do refinements of results relate? Watermarks & Exactly Once
44. Q? sli.do #BEAM
Recap Part 1
• Beam model useful for processing of Bounded & Unbounded Tables.
• Event vs Processing time & how it relates to Windowing and Triggering.
• Stateful processing is useful when working to guarantee correctness.
• State is managed with Watermarks, Late Data firings & Fault Tolerant
Exactly One semantics.
45. Q? sli.do #BEAM
TALK STRUCTURE
Beam model through Streams & Tables theory.
Getting started with Apache Beam.
Apache Beam Barcelona meetup.3
2
1
46. Q? sli.do #BEAM
Apache Beam
• Unified model
• Multiples languages
• Portable runners!
SQL
48. Q? sli.do #BEAM
PCollection
• Distributed Dictionary (inspired from RDD, Dataframes) but
can be bounded or unbounded.
• Source Readers
• Sink Writers
51. Q? sli.do #BEAM
ParDo
• ParDo applies an DoFn in distributed fashion.
• DoFn are User Dofined Functions. Which must be:
• Serializable
• Thread safe
• Idempotent
52. Q? sli.do #BEAM
High-level PTransforms
Filter ApproximateQuantiles Min
!FlatMapElements ApproximateUnique Sample
Keys CoGroupByKey Sum
KvSwap Combine Top
MapElements CombineWithContext Create
ParDo Count !Flatten
Partition Distinct PAssert
Regex GroupByKey View
Reify GroupIntoBatches Window
ToString HllCount
WithKeys Latest
WithTimestamps Max
Values Mean
55. Q? sli.do #BEAM
Recap Part 2
• Beam pipelines are combinations of PCollections + PTransformations
• A lot of out-of-the-box IOTransforms
• Separate your Readers and Writers for reusability & testability.
• Use high-level Transforms for a jump start.
• Identify in the model for good complex transforms design.
• Language & Runner independent FTW
56. Q? sli.do #BEAM
TALK STRUCTURE
Beam model through Streams & Tables theory.
Getting started with Apache Beam.
Apache Beam Barcelona meetup.3
2
1