Your SlideShare is downloading. ×
0
LilySmart data at scale    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
big data,big problems  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
MOORE vs data                                                                                      data» coping with volum...
the CHALLENGES» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK        IIC » TECHNO...
current thinking                         ETL                                     data    data STORE                       ...
1. store and manage all YOUR data                      DATAIIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outert...
2. store user behaviour, nearby                       DATAUSERBehavior IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) ...
3. analyze usage patterns                       DATA                  data processingUSERBehavior IIC » TECHNOLOGIEPARK 3 ...
4. add domain knowledge                       DATA                  data processingUSERBehavior                           ...
5. process, in real-time                       DATA                  data processing                                      ...
6. augment data                       DATA                  data processing                                              r...
data insights              SMARTER DATA                  data processing                           s                   rel...
data insights                    SMARTER DATA                  data processing                                 s          ...
stories  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
HYPER-PERSONALrecommendations                             NEWS                                                     TOGETHE...
up-sellingCROSS-SELLING                           product                           CATALOG                               ...
competitiveinnovation                            patents                                                      (dis)SIMILAR...
Outerthought» software product company» scalable content applications» open source product portfolio» Java, REST, internet...
Lily 1.0 (CR)                                  data  data STORE          +         warehouse             +        analytic...
Lily (now)» Large-scale content storage, indexing and search» Current pilots    e-retail     mobile media         isp     ...
roadmap» now: Lily 0.3                                                                » Along the road:» april 2011 : Lily...
open source» www.lilyproject.org» docs.outerthought.org/lily-docs-current/        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAA...
Lily Core Concepts» storage » HBase » repository model » versioning, varianting, mixins» indexing » mapping» search » SOLR...
falling in love with Hbase : phase 1» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse...
falling in love with Hbase : phase 2» need for consistency» atomic single-row updates» M/R for index regeneration        I...
falling in love with Hbase : phase 3 HBase» datamodel with column families and cell versioning» ordered tables with range ...
Lily Repository Model     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   27
Lily Datatypes     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
Mixins     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
Sample Lily Schema (excerpt)                                                                 

{namespaces:
{             ...
Lily Versioning     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   31
Flexible content model» generic enough to accomodate many popular content schemas » HTML5, CMIS, RDF, NewsML, Dublin Core,...
Lily Architecture(deployment)           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
Lily Architecture                    (components)                                   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJN...
HBase RowLog Library» need for sync/async operations » updating of secondary indexes (i.e. tables) » feeding of Indexer (=...
HBase RowLog Library» WAL                                                    » Queue » guaranteed execution of synchronous...
The Lily Indexer                                                                                                   shardin...
Indexing configuration (SOLR)<schema name="example" version="1.2"><types>  [snipped: see SOLR example schema]</types> <fie...
Indexer configuration (Lily)<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample">  <cases>    <case recordT...
(opt.) Sharding configuration{  shardingKey: {    value: {      source: "variantProperty",      property: "language"    },...
Lily API» Java (using Avro)  » http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html» REST (HTTP + JSON)  » h...
Demo» http://outerthought.blip.tv/file/4245615/        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthough...
Lily and HBase» adds high-level content model » data types » versioning » blob storage on HDFS» focus on sparse (efficient)...
Lily and SOLR» provides flexible mapping between HBase content  model and SOLR index fields» interactive and batch (M/R) ind...
Lily and CDH» we intend to rely on CDH-‘blessed’ versions of HBase/ HDFS/ZK » 700 patches and testing» next: adopting simi...
goodbye» It’s open source !» Content Repository: available now  (Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will ...
Thank you !                               for your attention                               for your questions             ...
Upcoming SlideShare
Loading in...5
×

Lily at HUG UK

3,450

Published on

presentation given 10/feb

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,450
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
51
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Lily at HUG UK"

  1. 1. LilySmart data at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  2. 2. big data,big problems IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  3. 3. MOORE vs data data» coping with volume + need for timeliness = parallel processing moore» data becomes business-critical = resilience through distributed architectures» Hadoop, MapReduce, HBase: the future data platform IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  4. 4. the CHALLENGES» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  5. 5. current thinking ETL data data STORE analytics warehousebatched, off-line, overnight IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  6. 6. 1. store and manage all YOUR data DATAIIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  7. 7. 2. store user behaviour, nearby DATAUSERBehavior IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  8. 8. 3. analyze usage patterns DATA data processingUSERBehavior IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  9. 9. 4. add domain knowledge DATA data processingUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  10. 10. 5. process, in real-time DATA data processing recommendations semantic augmentation AnalyticsUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  11. 11. 6. augment data DATA data processing recommendations semantic augmentation AnalyticsUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  12. 12. data insights SMARTER DATA data processing s relation recommendations semantic augmentation Analytics domain knowledge patterns rules keywords lists ...IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  13. 13. data insights SMARTER DATA data processing s relation recommendations semantic augmentation Analytics domain knowledge patterns rules keywords listsSMART DATA, at SCALE ...... and in real time IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  14. 14. stories IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  15. 15. HYPER-PERSONALrecommendations NEWS TOGETHERNESS interestingness organisations names locations brandsnews aggregatorscale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  16. 16. up-sellingCROSS-SELLING product CATALOG recommendedness relatedness product families related activities social graphe-retailreal-time IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  17. 17. competitiveinnovation patents (dis)SIMILARITY companies people materials processesIP researchinsights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  18. 18. Outerthought» software product company» scalable content applications» open source product portfolio» Java, REST, internet THIS NOTEBOOK BELONGS TO:“The world is moving Noteblock_03.indd 1 23/05/10 14:42from content as a cost todata as an opportunity.” IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  19. 19. Lily 1.0 (CR) data data STORE + warehouse + analytics real time } Lily 2.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  20. 20. Lily (now)» Large-scale content storage, indexing and search» Current pilots e-retail mobile media isp e-gov ip research» up-to now: 4 man-years investment (since Sept/2009) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  21. 21. roadmap» now: Lily 0.3 » Along the road:» april 2011 : Lily 1.0 Lily SaaS edition» Q3 2011 » real-time statistics + analytics» Q2 2012 : Lily 2.0 » real-time data processing engine » Data Insights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  22. 22. open source» www.lilyproject.org» docs.outerthought.org/lily-docs-current/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  23. 23. Lily Core Concepts» storage » HBase » repository model » versioning, varianting, mixins» indexing » mapping» search » SOLR IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  24. 24. falling in love with Hbase : phase 1» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse data» commodity hardware» efficient random access» community-based open source» Java if possible IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  25. 25. falling in love with Hbase : phase 2» need for consistency» atomic single-row updates» M/R for index regeneration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  26. 26. falling in love with Hbase : phase 3 HBase» datamodel with column families and cell versioning» ordered tables with range scans» HDFS for blob storage» Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  27. 27. Lily Repository Model IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  28. 28. Lily Datatypes IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  29. 29. Mixins IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  30. 30. Sample Lily Schema (excerpt) 

{namespaces:
{ 



name:
"b$name",



/*
Declaration
of
namespace
prefixes.
*/ 



valueType:
{
primitive:
"STRING"
},



"org.lilyproject.bookssample":
"b", 



scope:
"versioned"



"org.lilyproject.vtag":
"vtag" 

},

}, 

{fieldTypes:
[ 



name:
"b$bio",

{ 



valueType:
{
primitive:
"STRING"
},



name:
"b$title", 



scope:
"versioned"



valueType:
{
primitive:
"STRING"
}, 

},



scope:
"versioned" 

{

}, 



name:
"vtag$last",

{ 



valueType:
{
primitive:
"LONG"
},



name:
"b$pages", 



scope:
"non_versioned"



valueType:
{
primitive:
"INTEGER"
}, 

}



scope:
"versioned" 

],

}, recordTypes:
[

{ 

{



name:
"b$language", 



name:
"b$Book",



valueType:
{
primitive:
"STRING"
}, 



fields:
[



scope:
"versioned" 





{name:
"b$title",
mandatory:
true
},

}, 





{name:
"b$pages",
mandatory:
false
},

{ 





{name:
"b$language",
mandatory:
false
},



name:
"b$authors", 





{name:
"b$authors",
mandatory:
false
},



valueType:
{
primitive:
"LINK",
multiValue:
true
}, 





{name:
"vtag$last",
mandatory:
false
}



scope:
"versioned" 



]

}, 

}, ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  31. 31. Lily Versioning IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  32. 32. Flexible content model» generic enough to accomodate many popular content schemas » HTML5, CMIS, RDF, NewsML, Dublin Core, ... » academically verified » not limited to ‘content applications’ only» developer convenience » higher level constructs » schema reuse » versioning, linking, ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  33. 33. Lily Architecture(deployment) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  34. 34. Lily Architecture (components) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  35. 35. HBase RowLog Library» need for sync/async operations » updating of secondary indexes (i.e. tables) » feeding of Indexer (= bridge to SOLR index maintenance)» not: transactions» need for distribution and durability IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  36. 36. HBase RowLog Library» WAL » Queue » guaranteed execution of synchronous » triggering of async actions actions » e.g. (re)index (updated) record with » call doesn’t return before secondary SOLR back-end action finishes » size depends on speed of back-end » e.g. update secondary index tables process » if all goes well, size = #concurrent ops » useful outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  37. 37. The Lily Indexer sharding towards indexing of multiple incremental index blob contentdenormalization batch index building multiple SOLR versions of a record updating extraction instances IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  38. 38. Indexing configuration (SOLR)<schema name="example" version="1.2"><types> [snipped: see SOLR example schema]</types> <fields> <!-- Fields which are required by Lily --> <field name="@@key" type="string" indexed="true" stored="true" required="true"/> <field name="@@id" type="string" indexed="true" stored="true" required="true"/> <field name="@@vtag" type="string" indexed="true" stored="true" required="true"/> <field name="@@versionless" type="string" indexed="true" stored="true" required="false"/> <!-- Your own fields --> <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/></fields><uniqueKey>@@key</uniqueKey><defaultSearchField>title</defaultSearchField><solrQueryParser defaultOperator="OR"/></schema> IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  39. 39. Indexer configuration (Lily)<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample"> <cases> <case recordType="b:Book" variant="*" vtags="last" indexVersionless="true"/> </cases> <indexFields> <indexField name="title"> <value> <field name="b:title"/> </value> </indexField> <indexField name="authors"> <value> <deref> <follow field="b:authors"/> <field name="b:name"/> </deref> </value> </indexField> </indexFields></indexer> IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  40. 40. (opt.) Sharding configuration{  shardingKey: {    value: {      source: "variantProperty",      property: "language"    },    type: "string"  },  mapping: {    type: "list",    entries: [      { shard: "shard1", values: ["en", "it"] },      { shard: "shard2", values: ["nl", "de", "es"] }    ]  }} IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  41. 41. Lily API» Java (using Avro) » http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html» REST (HTTP + JSON) » http://docs.outerthought.org/lily-docs-current/g3/g2/427-lily.html» All docs » http://docs.outerthought.org/lily-docs-current/ext/toc/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  42. 42. Demo» http://outerthought.blip.tv/file/4245615/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  43. 43. Lily and HBase» adds high-level content model » data types » versioning » blob storage on HDFS» focus on sparse (efficient) storage» RowLog for synchronous cross-table updates and async message queues IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  44. 44. Lily and SOLR» provides flexible mapping between HBase content model and SOLR index fields» interactive and batch (M/R) index maintenance» sharding» use(s) SOLR as-is: loose, flexible, extensible coupling» search access via SOLR (HTTP) API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  45. 45. Lily and CDH» we intend to rely on CDH-‘blessed’ versions of HBase/ HDFS/ZK » 700 patches and testing» next: adopting similar distribution lay-out» since we contribute patches to ASF HBase trunk, we would expect CDH to track closely (until HBase 1.0)» some Lily users could be interested in ‘CDH-level’ services IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  46. 46. goodbye» It’s open source !» Content Repository: available now (Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will mainly focus on differentiating open source and enterprise edition» “HBase is wa de max maat.” IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  47. 47. Thank you ! for your attention for your questions » stevenn@outerthought.org » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×