• Save
Huguk lily
Upcoming SlideShare
Loading in...5
×
 

Huguk lily

on

  • 1,163 views

 

Statistics

Views

Total Views
1,163
Views on SlideShare
1,163
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Huguk lily Huguk lily Presentation Transcript

    • LilySmart data at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
    • big data,big problems IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
    • MOORE vs data» IDC says Digital Universe will be 35 data Zettabytes by 2020» 20% = enterprise data (structured, moore curated, $$$)» Facebook, Yahoo!, Google, Rapleaf, Amazon show us how the remaining 80% can be monetized » some of them even rent out their data platform » ... at the cost of infrastructure lock-in1 Zettabyte = 1,000,000,000,000,000,000,000 bytes, or 1 billion terrabytes IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
    • MOORE vs data data» coping with volume + need for timeliness = parallel processing moore» data becomes business-critical = resilience through distributed architectures» Hadoop, MapReduce, HBase: the future data platform IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
    • the CHALLENGES» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
    • current thinking ETL data data STORE analytics warehousebatched, off-line, overnight IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
    • 1. store and manage all YOUR data DATAIIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
    • 2. store user behaviour, nearby DATAUSERBehavior IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
    • 3. analyze usage patterns DATA data processingUSERBehavior IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
    • 4. add domain knowledge DATA data processingUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
    • 5. process, in real-time DATA data processing recommendations semantic augmentation AnalyticsUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
    • 6. augment data DATA data processing recommendations semantic augmentation AnalyticsUSERBehavior domain knowledge patterns rules keywords lists ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
    • data insights SMARTER DATA data processing s relation recommendations semantic augmentation Analytics domain knowledge patterns rules keywords lists ...IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
    • data insights SMARTER DATA data processing s relation recommendations semantic augmentation Analytics domain knowledge patterns rules keywords listsSMART DATA, at SCALE ...... and in real time IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
    • stories IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
    • HYPER-PERSONALrecommendations NEWS TOGETHERNESS interestingness organisations names locations brandsnews aggregatorscale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
    • up-sellingCROSS-SELLING product CATALOG recommendedness relatedness product families related activities social graphe-retailreal-time IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
    • competitiveinnovation patents (dis)SIMILARITY companies people materials processesIP researchinsights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
    • outerthought“ The world is moving from content as a costto data as an opportunity. We provide the toolsand the platform to let organisations maximallybenefit from the data they grow and collect. ” IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
    • Lily (now)» Large-scale content storage, indexing and search» Current pilots e-retail mobile media isp e-gov ip research» up-to now: 3 man-years investment (since Sept/2009) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
    • Lily 1.0 (CR) data data STORE + warehouse + analytics real time } Lily 2.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
    • lily USPs» Integrated approach, one-stop-data-shop » No more flat file processing (Hadoop) ➙ interactive database (HBase) ➡ all data» Real-time (vs. overnight) » instant feedback loops ➡ real-time » designed for on-line, interactive use ➡ easy» Available in-house, SaaS possible» Data Insights = data + customer retention IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
    • roadmap» now: Lily 0.3 » Along the road:» april 2011 : Lily 1.0 Lily SaaS edition» Q3 2011 » real-time statistics + analytics» Q2 2012 : Lily 2.0 » real-time data processing engine » Data Insights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
    • open source» www.lilyproject.org» docs.outerthought.org/lily-docs-current/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
    • Lily Core Concepts» storage » HBase » repository model » versioning, varianting, mixins» indexing » mapping» search » SOLR IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
    • falling in love with Hbase : phase 1» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse data» commodity hardware» efficient random access» community-based open source» Java if possible IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
    • falling in love with Hbase : phase 2» need for consistency» atomic single-row updates» M/R for index regeneration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
    • falling in love with Hbase : phase 3 HBase» datamodel with column families and cell versioning» ordered tables with range scans» HDFS for blob storage» Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
    • Lily Repository Model IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
    • Lily Datatypes IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
    • Mixins IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
    • Sample Lily Schema (excerpt) 

{namespaces:
{ 



name:
"b$name",



/*
Declaration
of
namespace
prefixes.
*/ 



valueType:
{
primitive:
"STRING"
},



"org.lilyproject.bookssample":
"b", 



scope:
"versioned"



"org.lilyproject.vtag":
"vtag" 

},

}, 

{fieldTypes:
[ 



name:
"b$bio",

{ 



valueType:
{
primitive:
"STRING"
},



name:
"b$title", 



scope:
"versioned"



valueType:
{
primitive:
"STRING"
}, 

},



scope:
"versioned" 

{

}, 



name:
"vtag$last",

{ 



valueType:
{
primitive:
"LONG"
},



name:
"b$pages", 



scope:
"non_versioned"



valueType:
{
primitive:
"INTEGER"
}, 

}



scope:
"versioned" 

],

}, recordTypes:
[

{ 

{



name:
"b$language", 



name:
"b$Book",



valueType:
{
primitive:
"STRING"
}, 



fields:
[



scope:
"versioned" 





{name:
"b$title",
mandatory:
true
},

}, 





{name:
"b$pages",
mandatory:
false
},

{ 





{name:
"b$language",
mandatory:
false
},



name:
"b$authors", 





{name:
"b$authors",
mandatory:
false
},



valueType:
{
primitive:
"LINK",
multiValue:
true
}, 





{name:
"vtag$last",
mandatory:
false
}



scope:
"versioned" 



]

}, 

}, ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
    • Lily Versioning IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
    • Flexible content model» generic enough to accomodate many popular content schemas » HTML5, CMIS, RDF, NewsML, Dublin Core, ... » academically verified » not limited to ‘content applications’ only» developer convenience » higher level constructs » schema reuse » versioning, linking, ... IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
    • Lily Architecture(deployment) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
    • Lily Architecture (components) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
    • HBase RowLog Library» need for sync/async operations » updating of secondary indexes (i.e. tables) » feeding of Indexer (= bridge to SOLR index maintenance)» not: transactions» need for distribution and durability IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
    • HBase RowLog Library» WAL » Queue » guaranteed execution of synchronous » triggering of async actions actions » e.g. (re)index (updated) record with » call doesn’t return before secondary SOLR back-end action finishes » size depends on speed of back-end » e.g. update secondary index tables process » if all goes well, size = #concurrent ops » useful outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
    • The Lily Indexer sharding towards indexing of multiple incremental index blob contentdenormalization batch index building multiple SOLR versions of a record updating extraction instances IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
    • Indexing configuration (SOLR)<schema name="example" version="1.2"><types> [snipped: see SOLR example schema]</types> <fields> <!-- Fields which are required by Lily --> <field name="@@key" type="string" indexed="true" stored="true" required="true"/> <field name="@@id" type="string" indexed="true" stored="true" required="true"/> <field name="@@vtag" type="string" indexed="true" stored="true" required="true"/> <field name="@@versionless" type="string" indexed="true" stored="true" required="false"/> <!-- Your own fields --> <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/></fields><uniqueKey>@@key</uniqueKey><defaultSearchField>title</defaultSearchField><solrQueryParser defaultOperator="OR"/></schema> IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
    • Indexer configuration (Lily)<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample"> <cases> <case recordType="b:Book" variant="*" vtags="last" indexVersionless="true"/> </cases> <indexFields> <indexField name="title"> <value> <field name="b:title"/> </value> </indexField> <indexField name="authors"> <value> <deref> <follow field="b:authors"/> <field name="b:name"/> </deref> </value> </indexField> </indexFields></indexer> IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
    • (opt.) Sharding configuration{  shardingKey: {    value: {      source: "variantProperty",      property: "language"    },    type: "string"  },  mapping: {    type: "list",    entries: [      { shard: "shard1", values: ["en", "it"] },      { shard: "shard2", values: ["nl", "de", "es"] }    ]  }} IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
    • Lily API» Java (using Avro) » http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html» REST (HTTP + JSON) » http://docs.outerthought.org/lily-docs-current/g3/g2/427-lily.html» All docs » http://docs.outerthought.org/lily-docs-current/ext/toc/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
    • Demo» http://outerthought.blip.tv/file/4245615/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
    • Lily and HBase» adds high-level content model » data types » versioning » blob storage on HDFS» focus on sparse (efficient) storage» RowLog for synchronous cross-table updates and async message queues IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
    • Lily and SOLR» provides flexible mapping between HBase content model and SOLR index fields» interactive and batch (M/R) index maintenance» sharding» use(s) SOLR as-is: loose, flexible, extensible coupling» search access via SOLR (HTTP) API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
    • Lily and CDH» we intend to rely on CDH-‘blessed’ versions of HBase/ HDFS/ZK » 700 patches and testing» next: adopting similar distribution lay-out» since we contribute patches to ASF HBase trunk, we would expect CDH to track closely (until HBase 1.0)» some Lily users could be interested in ‘CDH-level’ services IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
    • goodbye» It’s open source !» Content Repository: available now (Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will mainly focus on differentiating open source and enterprise edition» “HBase is wa de max maat.” IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
    • Thank you ! for your attention for your questions » stevenn@outerthought.org » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org