Content Management Using Hadoop
We
live in the world of cloud computing, best-of-breed applications and BYOX
(bring your own things). Companies are opening up to the idea of providing
freedom and choice of technology and tools. Freedom to use tools and
applications of choice shortens learning curve and promotes focus on innovation
and efficiency. But this freedom comes with cost. Enterprises need to have
strong technology infrastructure and processes to support variety of
applications, tools and platforms while ensuring security, privacy and
compliance.
Our
experience of working with publishing industry has let us observe this
bitter-sweet truth first hand. In the publishing world, the content is
generated by many internal and external contributors. In most cases it is
impossible to enforce usage of single content management systems and
ideation-to-publish process. So companies end up with large amount of content
being generated from discrete systems in various formats. Commonly content
being accumulated is: large amount of data, unstructured, in waves and
inconsistent. Learn more about Portals
and Content Management Services.
For
efficiency and consistency of publishing quality content, it is very important
to have set of common formats in place for content and digital asset
management. Common formats promote efficiency, modularity, standardization,
reuse content and other digital assets. Big data platforms like Hadoop can come
handy for publishing firms to apply a layer of common formats and processes on
top of large amount of unstructured content they accumulate from discrete
systems and individuals. The Hadoop Ecosystem provides technology platform
required to handle large unstructured content data to support enterprise scale
publishing process.
At
Jade Global, we have created a reference architecture to support and enhance
publishing process. This is based on our experience working with companies
dealing with large amount of unstructured content from discrete systems based
on Hadoop ecosystem. This architecture covers most commonly sought-after
functions of publishing process like aggregation, filtering, curation,
classification, indexing, standardization, modularization and workflow. There
are many more Hadoop Ecosystem components with potential usefulness for content
management and publishing, but the reference architecture covers most commonly
used functions. Also, it is possible to slice ecosystem component to implement
each function separately on top of Hadoop Core.

Core Functions of Reference Architecture:
Aggregate:
Hadoop
Flume Agent and Sink are very efficient at collecting unstructured data from
discrete systems. In typical configuration each source system is assigned with
a dedicated Flume agent, which will be configured to collect data in format
that source system is capable of providing. The beauty of Flume is that it
supports various formats so that there is no need for changes in source
systems. Also at Jade Global, our team can create custom Flume connectors to
collect data from unsupported proprietary systems. Function of Flume Sink is to
apply filter to incoming data and store in Hadoop Distributed File System. Sink
can be used to filter out data that is not needed for further publishing process.
Or it can perform simple transformation functions before storing content.
Storage:
Hadoop
Distributed File System provides reliable and high-performance storage for
structured and unstructured dat. Because of high-performance access and support
for unstructured data, HDFS is perfectly suited to store unstructured content
from various source systems. Jade’s Hadoop team specializes in installing,
administering, configuring and maintaining Hadoop Core components like HDFS.
Standardize:
MapReduce
is Hadoop’s data analysis, manipulating and programming engine. It delivers
high-performance data transformation capability with almost effortless
programming. With MapReduce’s ability to read, analyze and transform large
volume of unstructured data at lightning fast speed, it becomes powerhouse of
standardizing content in the format enterprise publishing process requires.
Jade’s specialists have experience developing MapReduce based standardization
processes including removing unnecessary content (like css styling, HTML tags),
changing content format from proprietary to industry standard open formats,
consolidating content files by type of content, modularizing content for future
reuse, duplicate identification & cleanup. Our passion and drive to explore
better ways to transform unstructured data continues to deliver new ways
optimize MapReduce for our clients.
Machine Learning:
Mahout
Machine learning platform is high-speed and highly scalable self-learning
platform which runs on top of Hadoop. Most common use cases of Mahout in
publishing process includes automatic classification of content segments,
identifying search tags for content segments, and automatically generating
metadata for content. Automatic content classification on large amount of
unstructured data using Mahout can bring huge efficiency and standardization
benefits to enterprises.
Search and Metadata:
Like
standardization, MapReduce can run high-speed search indexing and metadata
creation jobs on huge amount of data. At Jade Global, we have devised highly
efficient MapReduce based processes to generate search indexes from various
types of open and proprietary sources. Also we specialize in automatically
identifying custom metadata based on company’s requirements from unstructured
discrete content sources. We also assist our clients in installing,
administering, configuring and maintaining HBase database to store content
metadata and other transaction information. HBase is a Hadoop based column-oriented
no-sql database that delivers convenience of a relational database with
high-scalability and lightning fast performance.
Advantages of
Reference Architecture for Publishing Process
1. Freedom and Productivity:
Implementation of this reference architecture allows authors and contributors
to use platform of their choice for ideation, authoring and packaging content.
As the reference architecture includes standardization process, the
organization does not need to compromise security, privacy and standards
compliance while allowing discrete systems to generate content.
2. Common Formats and Processes: The reference architecture is designed to support common
publishing processes and formats. With high-speed standardization support, the
reference architecture and Hadoop ecosystem allows organizations to define and
enforce best practices and processes for publishing. Also, it allows for
continuous optimization of publishing process and formats to keep up with
changing business and technology needs.
3. Automation:
The reference architecture and Hadoop Ecosystem enables organization to
automate large portions of content publishing process with flexibility of human
intervention as needed. From content aggregation, standardization,
classification, indexing, search optimization to open standard publishing can
be automated using Hadoop Oozzie workflow engine.
4. Open Format Publishing:
This architecture promotes publication of content in open and industry standard
formats to achieve flexibility of publishing to multiple platforms like web,
print, mobile, social or even to content resellers. This allows publishing
businesses to explore non-traditional revenue streams and innovative ways to
deliver content.
5. Time to Market:
Automation, standardization, process focus and high-speed processing of large
amount of data enables businesses to publish content at fast pace. In today’s
competitive world of content publishing, each second spent from ideation to
publishing is critical for success of content delivery, it’s popularity and
revenue generated from it. The reference architecture and Hadoop Ecosystem
enables enterprises to achieve best-in-class efficiency for publishing process.
Original Source of Blog: https://www.jadeglobal.com/blog/content-management-using-hadoop
Comments
Post a Comment