Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
an introduction to pinot
Jean-François Im <jfim@linkedin.com>
2016-01-04 Tue
outline
Introduction
When to use Pinot?
An overview of the Pinot architecture
Managing Data in Pinot
Data storage
Realtime...
introduction
what is pinot?
∙ Distributed near-realtime OLAP datastore
∙ Used at LinkedIn for various user-facing (“Who viewed
my profil...
what is pinot
∙ Offers a SQL query interface on top of a custom-written
data store
∙ Offers near-realtime ingestion of eve...
example of queries
SELECT
weeksSinceEpochSunday,
distinctCount(viewerId)
FROM mirrorProfileViewEvents
WHERE vieweeId = ......
example of queries
7/38
how does “who viewed my profile” work?
8/38
usage of pinot at linkedin
∙ Over 50 use cases at LinkedIn
∙ Several thousands of queries per second across
multiple data ...
when to use pinot?
design limitations
∙ Pinot is designed for analytical workloads (OLAP), not
transactional ones (OLTP)
∙ Data in Pinot is i...
when to use pinot?
∙ When you have an analytics problem (How many of “x”
happened?)
∙ When you have many queries per day a...
an overview of the pinot
architecture
controller, broker and server
∙ There are three components in Pinot: Controller, broker
and server
∙ Controller: Handles c...
controller, broker and server
15/38
controller, broker and server
∙ All of these components are redundant, so there is no
single point of failure by design
∙ ...
managing data in pinot
getting data into pinot
∙ Let’s first look at the offline case. We have data in
Hadoop that we would like to get into Pinot....
getting data into pinot
∙ Data in pinot is packaged into segments, which contain
a set of rows
∙ These are then uploaded i...
getting data into pinot
∙ A segment is a pre-built index over this set of rows
∙ Data in Pinot is stored in columnar forma...
getting data into pinot
∙ Each segment file that is generated contains both the
minimum and maximum timestamp contained in ...
getting data into pinot
∙ Data uploaded into Pinot is stored on a segment basis
∙ Uploading a segment with the same name o...
data storage
data orientation: rows and columns
∙ Most OLTP databases store data in a row-oriented
format
∙ Pinot stores its data in a ...
data orientation: rows and columns
25/38
benefits of column-orientation
∙ Queries only read the data they need (columns not
used in a query are not read)
∙ Individ...
a couple of tricks
∙ Pinot uses a couple of techniques to reduce data size
∙ Dictionary encoding allows us to deduplicate ...
realtime data in pinot
tables: offline and realtime
∙ Pinot has two kinds of tables: offline and realtime
∙ An offline table stores data that has b...
data ingestion
∙ Realtime data ingestion is done through Kafka
∙ In the open source release, there is a JSON decoder
and a...
hybrid querying
∙ Since realtime and offline tables are disjoint, how are
they queried?
∙ If an offline and realtime table h...
hybrid querying
∙ Data is partitioned according to a time column, with a
preference given to offline data
32/38
data
∙ Since there are two data sources for the same data, if
there is an issue with one (eg. Kafka/Samza issue or
Hadoop ...
retention
retention
∙ Tables in Pinot can have a customizable retention
period
∙ Segments will be expunged automatically when their
...
retention
∙ Offline and realtime tables have different retention
periods. For example, “who viewed my profile?” has a
realti...
conclusion
conclusion
∙ Pinot is a realtime distributed analytical data store that
can handle interactive analytical queries running ...
Upcoming SlideShare
Loading in …5
×

Intro to Pinot (2016-01-04)

16,694 views

Published on

A short introduction to Linkedin's Pinot (http://github.com/linkedin/pinot)

Published in: Software
  • How I Got My Ex Husband Back........... I am Shannon by name. Greetings to every one that is reading this testimony. I have been rejected by my husband after three(3) years of marriage just because another woman had a spell on him and he left me and the kid to suffer. one day when i was reading through the web, i saw a post on how this spell caster on this address Makospelltemple@yahoo.com , have help a woman to get back her husband and i gave him a reply to his address and he told me that a woman had a spell on my husband and he told me that he will help me and after 2 days that i will have my husband back. i believed him and today i am glad to let you all know that this spell caster have the power to bring lovers back. because i am now happy with my husband. Thanks for Dr.Mako. His email: Makospelltemple@yahoo.com OR.His WhatsApp Number:+2347054263874.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Glorious be unto divine love spell the man who make me see reasons that there are still real and genuine spell casters like him. since 3 weeks now i have witness what is called heart broken. my boyfriend that promised me marriage failed me and impregnate me and leave,he dump me,he stop calling" he stop picking my calls,and he no longer respond to me. I have be looking for solution,I fall into the hands of fake spell caster,they rough me off and took my money without help.I have cried,I have weep"and tears runs out of eyes. the silentness in my heart brought me to the deepest path of failure that I lost my job. crying all day,because my life was lonely. so thanks to Dr. klin who came into my life and brought me the greatest joy that was lost. i saw his web on klinspelltemple@gmail.com while browsing and I contacted him, tell him what am passing through with no doubt because of what i saw about him,was enough to believe. and i was given words of solution on what to do. i can't really help thinking about it i have tried to see what i can do, i manage to provide him half of the money for the spell, and he help me with the rest. after casting the spell, 12hrs later, here comes my boyfriend with a rose flower on his hand and i was even about going out,i saw him in front of my door when he sees me he knee and said he is dying i should forgive him and accept him back he was crying,i can't wait to let him finish I quickly crab him and kiss him, just then" he said he is restless without me, just as the Dr. klin said it will be. he brought out a ring and put it on my hand. our wedding day was scheduled,1 week after we got married. today makes it a month and we are living happily I don't know how to praise him enough, he has done me a thing i can never forget in my life. and i can't really share to myself alone, I want you all to help me praise him because if it wasn't for him, i already plan of committing suicide. but right now am now so happy more than i was before. if you fine it difficult to get your ex boyfriend back, contact him via..... email klinspelltemple@gmail.com also add him on whats-app +2347059014517
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hey Jean , I want your help for deploying pinot and would like to get some more info about it !! ( email : saurabhmishra807@gmail.com )
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Intro to Pinot (2016-01-04)

  1. 1. an introduction to pinot Jean-François Im <jfim@linkedin.com> 2016-01-04 Tue
  2. 2. outline Introduction When to use Pinot? An overview of the Pinot architecture Managing Data in Pinot Data storage Realtime data in Pinot Retention Conclusion 2/38
  3. 3. introduction
  4. 4. what is pinot? ∙ Distributed near-realtime OLAP datastore ∙ Used at LinkedIn for various user-facing (“Who viewed my profile,” publisher analytics, etc.), client-facing (ad campaign creation and tracking) and internal analytics (XLNT, EasyBI, Raptor, etc.) 4/38
  5. 5. what is pinot ∙ Offers a SQL query interface on top of a custom-written data store ∙ Offers near-realtime ingestion of events from Kafka (a few seconds latency at most) ∙ Supports pushing data from Hadoop ∙ Can combine data from Hadoop and Kafka at runtime ∙ Scales horizontally and linearly if data size or query rate increases ∙ Fault tolerant (any component can fail without causing availability issues, no single point of failure) ∙ Automatic data expiration 5/38
  6. 6. example of queries SELECT weeksSinceEpochSunday, distinctCount(viewerId) FROM mirrorProfileViewEvents WHERE vieweeId = ... AND (viewerPrivacySetting = ’F’ OR ... OR viewerPrivacySetting = ’’) AND daysSinceEpoch >= 16624 AND daysSinceEpoch <= 16714 GROUP BY weeksSinceEpochSunday TOP 20 LIMIT 0 6/38
  7. 7. example of queries 7/38
  8. 8. how does “who viewed my profile” work? 8/38
  9. 9. usage of pinot at linkedin ∙ Over 50 use cases at LinkedIn ∙ Several thousands of queries per second across multiple data centers ∙ Operates 24x7, exposes metrics for production monitoring ∙ The internal de facto solution for scalable data querying 9/38
  10. 10. when to use pinot?
  11. 11. design limitations ∙ Pinot is designed for analytical workloads (OLAP), not transactional ones (OLTP) ∙ Data in Pinot is immutable (eg. no UPDATE statement), though it can be overwritten in bulk ∙ Realtime data is append-only (can only load new rows) ∙ There is no support for JOINs or subselects ∙ There are no UDFs for aggregation (work in progress) 11/38
  12. 12. when to use pinot? ∙ When you have an analytics problem (How many of “x” happened?) ∙ When you have many queries per day and require low query latency (otherwise use Hadoop for one-time ad hoc queries) ∙ When you can’t pre-aggregate data to be stored in some other storage system (otherwise use Voldemort or an OLAP cubing solution) 12/38
  13. 13. an overview of the pinot architecture
  14. 14. controller, broker and server ∙ There are three components in Pinot: Controller, broker and server ∙ Controller: Handles cluster-wide coordination using Apache Helix and Apache Zookeeper ∙ Broker: Handles query fan out and query routing to servers ∙ Server: Responds to query requests originating from the brokers 14/38
  15. 15. controller, broker and server 15/38
  16. 16. controller, broker and server ∙ All of these components are redundant, so there is no single point of failure by design ∙ Uses Zookeeper as a coordination mechanism 16/38
  17. 17. managing data in pinot
  18. 18. getting data into pinot ∙ Let’s first look at the offline case. We have data in Hadoop that we would like to get into Pinot. 18/38
  19. 19. getting data into pinot ∙ Data in pinot is packaged into segments, which contain a set of rows ∙ These are then uploaded into Pinot 19/38
  20. 20. getting data into pinot ∙ A segment is a pre-built index over this set of rows ∙ Data in Pinot is stored in columnar format (we’ll get to this later) ∙ Each input Avro file maps to one Pinot segment 20/38
  21. 21. getting data into pinot ∙ Each segment file that is generated contains both the minimum and maximum timestamp contained in the data ∙ Each segment file also has a sequential number appended to the end ∙ mirrorProfileViewEvents_2015-10-04_2015-10-04_0 ∙ mirrorProfileViewEvents_2015-10-04_2015-10-04_1 ∙ mirrorProfileViewEvents_2015-10-04_2015-10-04_2 21/38
  22. 22. getting data into pinot ∙ Data uploaded into Pinot is stored on a segment basis ∙ Uploading a segment with the same name overwrites the data that currently exists in that segment ∙ This is the only way to update data in Pinot 22/38
  23. 23. data storage
  24. 24. data orientation: rows and columns ∙ Most OLTP databases store data in a row-oriented format ∙ Pinot stores its data in a column-oriented format ∙ If you have heard the terms array of structures (AoS) and structure of arrays (SoA), this is the same idea 24/38
  25. 25. data orientation: rows and columns 25/38
  26. 26. benefits of column-orientation ∙ Queries only read the data they need (columns not used in a query are not read) ∙ Individual row lookups are slower, aggregations are faster ∙ Compression can be a lot more effective, as related data is packed together 26/38
  27. 27. a couple of tricks ∙ Pinot uses a couple of techniques to reduce data size ∙ Dictionary encoding allows us to deduplicate repetitive data in a single column (eg. country, state, gender) ∙ Bit packing allows us to pack multiple values in the same byte/word/dword 27/38
  28. 28. realtime data in pinot
  29. 29. tables: offline and realtime ∙ Pinot has two kinds of tables: offline and realtime ∙ An offline table stores data that has been pushed from Hadoop, while a realtime sources its data from Kafka ∙ These two tables are disjoint and can contain the same data 29/38
  30. 30. data ingestion ∙ Realtime data ingestion is done through Kafka ∙ In the open source release, there is a JSON decoder and an Avro decoder for messages ∙ This architecture allows plugging in new data ingestion sources (eg. other message queuing systems), though at this time there are no other sources implemented 30/38
  31. 31. hybrid querying ∙ Since realtime and offline tables are disjoint, how are they queried? ∙ If an offline and realtime table have the same name, when a broker receives a query, it rewrites it to two queries, one for the offline and one for the realtime table 31/38
  32. 32. hybrid querying ∙ Data is partitioned according to a time column, with a preference given to offline data 32/38
  33. 33. data ∙ Since there are two data sources for the same data, if there is an issue with one (eg. Kafka/Samza issue or Hadoop cluster issue), the other one is used to answer queries ∙ This means that you don’t get called in the middle of the night for data-related issues and there’s a large time window for fixing issues 33/38
  34. 34. retention
  35. 35. retention ∙ Tables in Pinot can have a customizable retention period ∙ Segments will be expunged automatically when their last timestamp is past the retention period ∙ This is done by a process called the retention manager 35/38
  36. 36. retention ∙ Offline and realtime tables have different retention periods. For example, “who viewed my profile?” has a realtime retention of seven days and an offline retention period of 90 days. ∙ This means that even if the Hadoop job doesn’t run for a couple of days, data from the realtime flow will answer the query 36/38
  37. 37. conclusion
  38. 38. conclusion ∙ Pinot is a realtime distributed analytical data store that can handle interactive analytical queries running on large amounts of data ∙ It’s used for various internal and external use-cases at LinkedIn ∙ It’s open source! (github.com/linkedin/pinot) ∙ Ping me if you want to deploy it, I’ll help you out 38/38

×