Computer science is like an enormous tool box you can rummage
through whenever you have a problem to solve. Most of the tools
are sturdy and practical, like algorithms for B-trees. Some are
also elegant, like consistent hashing in Dynamo. Finally there
are some tools that you never quite figure out even after years
of reflection. That piece of steel you are looking at could be
Excalibur. Or it could be a rusty knife.
The CAP theorem falls into the last category, at least
for me. It was a major topic in the blogosphere a few years
ago and Google Trends shows steadily increasing interest in the term since
2010. It's not my goal to explain CAP fully--a good
informal description is …
In my previous post I pointed out that the
existing ARCHIVE
storage engine in MySQL may
not be the one that will satisfy your needs when it comes to
effectively storing large and/or old data. But are there any good
alternatives? As the primary purpose of this engine is to store
rarely accessed data in disk space efficient way, I will focus
here on data compression abilities rather then on performance.
The InnoDB engine provides compressed row format, but is it’s efficiency even close to the one from that available in archive engine? You can also compress MyISAM tables by using myisampack tool, but that also means a table will be read only after such operation.
Moreover, I don’t trust MyISAM nor Archive when it comes to data durability. Fortunately along came a quite new (open source since April …
[Read more]SAP HANA , having entered the data 2.0/3.0 space at the right time, has been getting traction lately; and there will be lot of users like me who wants to[...]
As this topic came up a few times this week for discussion at various places, I thought of composing a post on “Data Scientist vs. Data Analytics Engineer”; even though[...]
Apart from my consulting as part of ScaleIn, I also invest to bootstrap companies with really disruptive ideas; and in the process met few database specific companies who are already[...]
Here is the typical “Big” data architecture, that covers most components involved in the data pipeline. More or less, we have the same architecture in production in number of places[...]
As described in the first article of this series, Tungsten
Replicator can replicate data from MySQL to Vertica in real-time.
We use a new batch loading feature that applies
transactions to data warehouses in very large blocks using COPY
or LOAD DATA INFILE commands. This second and concluding
article walks through the details of setting up and testing MySQL
to Vertica replication.
To keep the article reasonably short, I assume that readers are
conversant with MySQL, Tungsten, and Vertica. Basic
replication setup is not hard if you follow all the steps
described here, but of course there are variations in every
setup. For more information on Tungsten check out the
Tungsten Replicator project at code.google.com
site well as …
Real-time analytics allow companies to react rapidly to changing
business conditions. Online ad services process
click-through data to maximize ad impressions. Retailers
analyze sales patterns to identify micro-trends and move
inventory to meet them. The common theme is speed: moving
lots of information without delay from operational systems to
fast data warehouses that can feed reports back to users as
quickly as possible.
Real-time data publishing is a classic example of a big
data replication problem. In this two-part article
I will describe recent work on Tungsten Replicator to move data out of MySQL
into Vertica at high speed with minimal load on
DBMS servers. This feature …
If you're in the Los Angeles area on Feb 15, come hear my talk at
LAMySQL inspired by learnings from real-life experiences. In
addition to hearing a very unique and interesting talk, you can
win an AppleTV thanks to awesome folks at @NoodleYard.
Real-Life Use Cases From Data Administration
Hell
Data is the most valuable asset of an organization because it's
irreplaceable.
Yet, we hear about f**k ups related to data administration every
day by startups and organizations of all sizes. Sometimes it's no
one's fault. Sometimes it's the fault of a drunk friend who
shouldn't have been [wherever he was] at the first place.
Yet, at other times, the disaster could have been prevented.
Sometimes, these f**k ups are caused by bad design. Sometimes,
it's a bad …
Googling around, I came across Bradford Cross' article, Big Data Is Less About Size, And More About
Freedom. Bradford writes, " The scale of data and
computations is an important issue, but the data age is less
about the raw size of your data, and more about the cool stuff
you can do with it."
Even though the article makes some good points, I'm not sure I
can agree with Bradford's point of view here. As an architect,
when I think in terms of Big Data, the ability to do "cool stuff"
is probably the last thing that crosses my mind. Big Data, to me,
is about ensuring constant response time as the data grows in
size without sacrificing functionality.
What do you think Big Data is about? Is it merely about being
able to do 'cool stuff' with your data? Is it about ensuring
constant access/response times? Or is it about something else?
I'm eager …