Data Quality Powered by Big Data
Enough has been said about the importance of data in an enterprise.
Data has the power to drive decisions, deliver actions, bring efficiency and
directly impact the bottom line. To realize the true potential of data,
organizations need to make sure that their data is accurate, complete, concise,
easily accessible, secured and consumption ready. In a highly competitive
environment today, companies don’t have the luxury of vetting through many
spreadsheets and documents. Data-driven decisions must be timely to be
effective.
Almost every
organization has many sources of data inputs containing same or different data
attributes for the same entities. For example, information about customer entity
can flow-in through web & mobile self-service, social media outlets, census
and other government data sources, credit agencies, log files etc. A lot of
times information received for a unique customer is conflicting and a lot of
times information about two different customers seems too familiar. These
bitter-sweet problems are usually addressed by Master
Data Management (MDM) software.
Traditional MDM
software licenses are expensive for enterprises. Also they have scalability
issues for Big Data and cannot handle unstructured input sources like social
feeds. To make the data work efficiently for enterprises, big
data technology platforms come handy in ensuring optimal data
quality with value addition of automation and discovery of hidden opportunities
within data.
From our years
of experience working on various data platforms - ERPs and CRMs, we have
developed a reference architecture to implement optimal and cost-effective Data
Quality technology using open source big data platforms for enterprises. The
strength of our reference architecture lies in scalability and openness of
these platforms. We can scale this architecture to work with smaller data sets
or for Petabytes of data. Also, there are no limitations on input formats.
Using open source ingestion technologies, this implementation has ability to
ingest data from virtually any source in any format.
Technology
Hadoop: Hadoop core and ecosystem component are best suited for
ensuring optimal data quality for the growing amount and complexity of data. It
offers reliable, scalable, low-cost and high-speed storage and processing
engine which is essential for data processing needs. Ingestion technologies
like Flume and Sqoop enables Hadoop to collect data from virtually any source
including databases, cloud applications, social platforms, logs, documents, FTP
or any venue for electronic data input. Hadoop Distributed File System (HDFS)
enables reliable and scalable storage of any form of data with mainly
processing efficiency-based design. MapReduce is Hadoop’s processing engine
that delivers high-speed processing of data that is already stored in HDFS.
These components are perfectly suited for collecting data from disrete sources,
aggregating data and standardizing
Spark: Apache Spark is an in-memory computing framework designed to
bring real-time factor to Big Data analytics. Spark excels at loading data in
memory for complex data processing resulting in lightning fast results of
complex data exploration, sampling, mining and analytics processes. SparQL
which is the query language of Spark, is perfectly suited for ad-hoc data
analysis. Also Spark ships with MLib machine learning platform which enables
organizations to build predictive models based on historical data.
Solr: Apache Solr is designed as high-speed index and search engine
around unstructured data. For data quality purposes, Solr have ability to run
matching and cleaning processes using fuzzy-matching algorithms. Depending on
business rules configured, Solr can automate duplicate identification and
merging process with no or minimal human intervention.
Hue: Apache Hue is rich and interactive administrative and
reporting dashboard mainly for Hadoop. It offers monitoring, scripting, data
exploration and dashboard capabilities. Also it can integrate Spark and Solr
results as plugins to dashboard for centralized access to all data from various
tools in this reference architecture. Depending on data quality needs of the
organizations, we have configured Hue to optimize power of data without
reinventing the wheel. But in some cases we have also developed custom user
interface to interact with data using Node.js and Angular.js.
Data Quality
Based on our
years of experience ensuring optimal data quality for large organizations, we
have devised standard processes, component and tools enabling our clients to
get a head-start on automated data quality process. We bring our big data
technology and data quality functional expertise together to ensure that data
quality becomes an effortless but tremendously valuable tool for the business.
Data Accuracy: In the world of discrete best-of-breed applications, companies
often deal with numerous data formats. Data standardization helps companies
mine, explore, visualize, dashboard and monetize data with ease. Our aggregator
adaptors can collect data from various source systems and execute real-time
standardization of algorithms. Standardization is determined by our clients as
it best suits them but we can recommend industry standard formats based on our
experience. Also we embed USPS address matching & cleansing, email address
verification, change of address (NCOA) service, individual demographics (based
on public and credit data) and organization demographics (Duns & Bradstreet
data) as part of our standardization process. These components allow us to run
high-speed weighted duplicate identification and merger of duplicate records
near real-time using big data technology stack.
Data Management: Our data management process enables business focused structure
on large amount of structured and unstructured data from numerous source
systems. Using our data management process and tools, our clients are able to
implement layers of security and enforce industry and government compliance
requirement while making data available to right people at right time. Also our
specialization in data modeling and change management enables clients to
implement light-weight but efficient data governance. At the end of the day,
technology is only a part of what ensures optimal data quality. Data management
processes and tools are key in identifying data quality needs and solutions.
Data Discovery: Our data discovery tools allow companies to fill-in-the-blanks
enabling them to see more dimensions of their historical and transactional
data. We utilize fuzzy data generation and machine learning algorithms to
generate additional data fields unlocking full potential hidden in existing
data. We also utilize publicly available data sets (like census), credit files
(with authorization), and demographic information and web crawlers to generate
additional data fields. Data discovery always brings positive surprise to large
companies as they start discovering the information, they never knew they
could.
Platform: Our reference architecture for data quality management using
big data technologies, comprises of open source platform those fit right into
any enterprise technology footprint without disruption. Our experts specialize
in extending, customizing, installing, configuring, administering and
implementing these tools for data quality needs. The entire architecture is
designed to be flexible, scalable, high-speed and cost-efficient. We also offer
a managed service environment for this reference architecture in our
private-cloud offering.
At Jade Global
Inc., we specialize in data quality management using big data platforms. Our
offerings include big data strategy, road-mapping, architecture, business case,
implementation, technical support and managed services.
Original Source of Blog: https://www.jadeglobal.com/blog/data-quality-powered-big-data
Comments
Post a Comment